[Web App] WaniKani History + Timemachine

Glad you could fix it :smile:

1 Like

Interesting. I’ll try to keep that in mind the next time someone has this issue.

(I don’t know if it’s just Deja Vu, but now I feel like I remember someone else having a similar issue earlier this year, where they fixed it by generating a new API key. :thinking: )

2 Likes

That also worked for me actually, like I said. I hope that’s always a solution but I don’t know if it was just a fluke? Anyway, at least I know to give that advice when somebody else asks again!

2 Likes

Thanks for your answer! As far as I know, Morphman uses a library called MeCab. Essentially, you can input a string of text (e.g. a novel) and it decomposes it into morphemes, with which it would be possible to create a deck (e.g. for Anki) or measure the overlap between two corpora (what you call ‘coverage’ as the overlap between WK and e.g. the Genji corpus). There are other ways of decomposing text in Japanese that doesn’t involve morphemes but it’s pretty hard (and in my view not necessary to measure corpora overlap).

I hope the links might be of some help but I’m sure you’re much more competent in figuring out this problem, if you set out to do so :slight_smile:

1 Like

I don’t mean to spam the topic with this word coverage thing, but I’ve been thinking and looking a bit more and e.g. in the top post in this reddit thread, they summarize the following:

Natural Speech Corpus: Subtitles from 70+ hours of Japanese reality TV compiled by me using MeCab parser; 10,953 unique words of 322,090 total words

Twitter Corpus: Japanese Twitter posts compiled by Worldlex in 2011 (http://worldlex.lexique.org/); 186,631 unique words of 14,358,005 total words

[…]

Books Corpus: From 3,297 books made publicly available through Japanese digital library called Aozora Bunko (https://www.aozora.gr.jp/) compiled by me using MeCab parser; 209,625 unique words of 30,761,634 total words

They used MeCab as well, as I mentioned in the previous post. I think what ‘startled’ me a bit is the sheer number of words detected in these corpora (or rather the variance between the corpora). The reality TV corpus has around 11k words, which is about in the same order of magnitude as the WK vocabulary list (6000). But if you look at Twitter (which has nota bene a lot of noise and neologisms) or the book corpus, it’s a good order of magnitude higher in word count. What surprised me is that in a way we are talking a lot about how many kanji you need to cover such and such a corpus or literary text like Genji, but in a way it might be that even though you understand the kanji, you might not get anything at all, because you’re not covering around 90% of the meaning in the text but only like 10% (in the worst case).

I guess what I’m asking is how much do you really know after WK Level 60. 95% kanji covering is nice, but if it means only 10% of words (I’m understating it for emphasis) it would sound pretty dire!

1 Like

No problem :smile: I’m always open to new requests!

Something important to mention (because you mentioned the Twitter corpus having around 180,000 words) is that although there are many many words that can be used - especially if you look at a huge corpus like Twitter - most of these are very rare and you wouldn’t actually use them in daily conversation or even see them in texts.
Also, I don’t know if in the analysis they conducted they differentiated between the hiragana, katakana and kanji versions of words, as well as different spellings. That could definitely bloat the number of “unique words” even though there aren’t even that many.
Roughly speaking I heard the number 5000 words for fluency thrown around often. Obviously this is for speech and just a very rough estimate but I don’t think you need to know more than 10000 words to read tweets even if they use a lot of slang you have to get used to.
So I guess what I wanted to say is that WaniKani has a good basis for you in terms of vocab coverage (6000 words) even though it’s debatable if they’re all relevant but after reaching level 60 I think you will definitely be able to read almost anything with a bit of looking up here and there. Though this is not from experience obviously, as I’m just level 25, but I noticed my understanding of words in Japanese progress rapidly while learning with WaniKani.

1 Like

I completely agree with your assessment that sometimes the collection of these corpora can be a bit dubious, either because it’s done my non-linguists or it’s a hard to quantify corpus like twitter. But the number 100,000 seems kind of what many languages have as a volume of unique lemmas, so even though I am also a bit sceptical of the twitter corpus, the Aozora one is likely not only made by smart people but perhaps also a good approximation of the actual number of lemmas.

Of course, as you point out, it’s not necessary to know all words. But a plot that shows how many percent of Aozora words (as opposed to kanji) one understands at WK level x, could be cool (and I don’t have a hunch how much that could be since most coverage analyses are usually done with kanji, not words).

Thanks for indulging in my rambly musings :slight_smile:

2 Likes

Hello, it’s me again ◔_◔

I was thinking about this figure (and its analog on WKstats):

What ‘bothered’ me a bit is that each WK level has a different number of items to study. I think what would be a neat and statistically sound way to make this a bit less ‘misleading’ is if in the ‘level info’ option on top left of the plot would have a check box to normalise each level by the number of items (i.e. 2 items per kanji and vocab, 1 item per radical).

The first level of WK has only 136 items while the 3rd for instance has 220 (and the 7th has 307!). The plot as is creates the impression that one gets ‘slower’ after the first few lessons but it’s just because they are shorter. Since there is some variability within the levels (especially the shorter ones at the end of REALITY again) it might be good normalise.

In my case, instead of 4 days and 15 hours for level 1, the y-axis would read ‘time per item’ and be 0.82 h per item. The third level would be 0.84 h per item, which is now much closer to my speed at level 1 and reassures me that I didn’t slow down.

It’s just a suggestion, if it makes sense to you :slight_smile:

2 Likes

(just a 3rd-party observation, but…)
I think as you get farther along, you’ll find that dividing by the number of items really won’t give you a sense of speed, because the number of items per level doesn’t have much to do with when you level up. For example, no matter how many items are in a level, you can miss up to 10% of them and still level up in the same amount of time, but once you reach 10.01% wrong answers, your time on the level suddenly increases because you won’t level up until 90% of your kanji reach Guru. (And that’s a simplification, because it also matter, to some extent, which items you miss)

3 Likes

I believe what rfindley said to be true.
Nevertheless I can add a sort of “learn speed” chart (or whatever you wanna call it) that shows you how many items you “learned” per day/week/whatever if you want. Although that is kinda similar to the SRS items info; it’s just not a total amount learned.
I don’t know if that is exactly what you were looking for but I think soon a “weighted level graph” with the weights being the amount of items for a level will not be that useful to you. Definitely tell me though if I missed something or misunderstood what you meant.

1 Like

Thanks for both of your answers, @saraqael and @rfindley. I think your argument is in a way true and I thought about it as well. The variance in how fast you beat the level is dominated by some other effects (e.g. skip one day, typo, cat walks over keyboard, etc., and your time might be increased by a bunch [as you illustrate with the 10.1 %]) rather than by the number of cards.

The most accurate way of doing it would be if only the time spent per card could be measured but of course we have a much more coarse-grained scenario at hand. Maybe normalizing the graph doesn’t matter much, but it also wouldn’t hurt much. I think it would be a tiny bit more accurate but still not very accurate and maybe not worth to implement.

Thanks for entertaining the idea though :slight_smile:

EDIT: Oh, rfindley, I just realized that you’re the wkstats-person! It’s a great site, thanks a bunch!

1 Like

@saraqael any chance of a setting to customize when a day’s “start time” is for the streak counter? I recently lost my streak because I did my reviews just after midnight - but I’d like those to be counted in the previous day, and not break my streak. The WaniKani heatmap script for example lets you select a “day start time” and then counts anything before that (so for example 2am) as belonging to the previous day.
I’d do it myself but am really busy right now :sweat_smile:

1 Like

It’s really frustrating that you lost your streak but yeah, I think it’s definitely possible to make that a setting! I’m pretty busy currently, too. But I should get around to it soon!

1 Like

@Hubbit200 This commit in theory should force an offset of +2 hours (e.g. day starts at 2 a.m.)

If you’d like, you can click browse files → code → download ZIP and test this out to see if it works:

1 Like

That does seem to fix my streak, so I assume it’s working correctly! Line 73 does need a “return” at the start though. Thanks! :slight_smile: