So, is the thought then that you have an idea about geometric scaling that will result in a more bendy distribution? Are you still intending on rescaling the data?
I can’t precisely answer that because I don’t know how you’re defining “ranked”. If you have an item who’s “score” is higher than 90% of all the other scores, then that item is in the 10th percentile. If his score is only higher than 1% of the other scores, the he’s in the 99th percentile.
I am not rescaling anything at present, I am still getting a feel for how the data looks now that I myself am correctly calculating the percentile.
ETA: I’m pulling in your data now to see how it looks with my cruncher output as well. For future files can you do them as file attachments? Getting all that text selected was a little tedious.
I just realized that the example you provided is just a subsection of your data, right? Can you send me the whole file? I’ve written something to suck it in as a ‘raw’ then spit out my intermediate format before going on to the final format, so if you don’t change that format I’ll be able to easily handle it from here on. If you do want to change something in it just let me know.
Absolutely, sorry, didn’t realize you were going to use it. Here’s the unabridged data set, obviously still not wrapped as it will be in the over-blob, nor filtered by WK words yet. Here it is.
I’ll start working with other corpora later today. For the moment, time to get back to WK itself! :-p
While I think and wait on that file I thought taking another lap around what this data “means” Right now with no ‘massaging’ or anything else, just (correctly) calculating raw percentiles from the corpus, the wikipedia data would result in a star ranking that looks like this:
5 stars: 211 items. Only 1% of the corpus is covered, but 56.55% (by volume) of all of wikipedia is covered.
4.5 stars: +424 items (635 total). 2% more of the corpus. 83.81% of all ‘volume’.
4 stars: +425 items (1060 total), 2% more of the corpus. 93.34% of all occurrences.
Etc.
To me, again, this “feels” right. If You learned all 4star kanji from the wiki corpus, you would know 93% of all the kanji you encounter when reading wikipedia, even though you only had to learn about 5% of the corpus. Do you not agree with this?
As for the file, I had to do “something” with it – I couldn’t say if it looked good or bad by just looking at a few hundred lines of json with human eyeballs. I’ll try the new file now!
Totally. I am 100% onboard through 4.5 stars, those numbers look great and are pretty near to the ones I came up with. My problem with the breakdown is 3, 2.5, 2, 1.5 and 1 stars. My gut tells me that each of those groups should be something like twice as large as the one before it, just as 5 was from 4.5. Each of the ranks has about 450, except for 10, 1 and 0. What I think that will lead to is a high number of rank 0 items in WaniKani, and basically no 3, 2.5, 2, 1.5 or 1 star words.
To clarify, when I say rank, I mean position in the original frequencies document. There is only one rank 1, and it’s almost always の. I’ve been using the word “score” to represent something abstract, and in particular, a star rating, but now with this new algorithm it’s reasonable for me to also talk about scores more organically (as decimals, rather than integers).
I think primarily our disagreement comes down to whether we base our divisions on “rank” (as in, position within the document) or on volume/coverage/percentile. For me, rank is just kindof an artifact of the way the data is presented, and not valuable information in and of itself, whereas volume based percentile actually represents (regardless of other items in the document) the prevalence of the item in the language. Do you agree with my characterization? Do you feel like rank is a more appropriate way of determining usefulness than volume?
Ok, the data from that file definitely looks different than the wiki data – it’s closer to the NHK easy data. To me this indicates that the vocabulary is fairly specialized, and you’ll need all or most of it in order to be fluent within that particular arena. Here are all three of them. Again, raw percentile based ranking: I calculate it as “round((1 - percentile) * 10)”, which means the top 5% end up at rank 10, then it’s 10% chunks on down to the bottom. This is probably what you were asking about before and it wasn’t clicking with me – the rounding at either “end” only gets half a percentile, since only 9.5 and up round to 10, but 8.5 to 9.4 all round to 9. I can change this to a floor() and see how it looks.
Here it is with the round:
======= wikipedia
rank items cumui corpus cumucor coverage cumucov
10 211 211 1.01% 1.01% 56.54846% 56.55%
9 424 635 2.03% 3.03% 27.26210% 83.81%
8 425 1060 2.03% 5.06% 9.52575% 93.34%
7 425 1485 2.03% 7.09% 3.75061% 97.09%
6 429 1914 2.05% 9.14% 1.57918% 98.67%
5 429 2343 2.05% 11.19% 0.68835% 99.35%
4 444 2787 2.12% 13.31% 0.32421% 99.68%
3 480 3267 2.29% 15.61% 0.15725% 99.84%
2 554 3821 2.65% 18.25% 0.07999% 99.92%
1 1027 4848 4.91% 23.16% 0.05313% 99.97%
0 16084 20932 76.84% 100.00% 0.03096% 100.00%
======= kanji_easy
rank items cumui corpus cumucor coverage cumucov
10 16 16 0.91% 0.91% 19.63407% 19.63%
9 35 51 2.00% 2.91% 17.69248% 37.33%
8 37 88 2.11% 5.03% 11.20935% 48.54%
7 38 126 2.17% 7.20% 8.03893% 56.57%
6 49 175 2.80% 10.00% 7.55132% 64.13%
5 66 241 3.77% 13.77% 7.71259% 71.84%
4 65 306 3.71% 17.49% 5.86145% 77.70%
3 90 396 5.14% 22.63% 5.89180% 83.59%
2 157 553 8.97% 31.60% 6.65516% 90.25%
1 337 890 19.26% 50.86% 6.63429% 96.88%
0 860 1750 49.14% 100.00% 3.11856% 100.00%
======= stella
rank items cumui corpus cumucor coverage cumucov
10 201 201 0.23% 0.23% 48.29940% 48.30%
9 410 611 0.46% 0.68% 11.56216% 59.86%
8 421 1032 0.47% 1.16% 6.18457% 66.05%
7 477 1509 0.53% 1.69% 4.56944% 70.62%
6 520 2029 0.58% 2.27% 3.57885% 74.19%
5 608 2637 0.68% 2.95% 3.08846% 77.28%
4 859 3496 0.96% 3.91% 3.22108% 80.50%
3 1278 4774 1.43% 5.35% 3.44624% 83.95%
2 2462 7236 2.76% 8.10% 4.30774% 88.26%
1 8066 15302 9.03% 17.13% 6.36061% 94.62%
0 74013 89315 82.87% 100.00% 5.38145% 100.00%
Here it is with the floor() (9.x goes to 9, 8.x goes to 8, etc) Note that this is multiplying the percentile by 11 instead of 10, so I can get 11 rankings, but that doesn’t really impact things itself much.
======= wikipedia
rank items cumui corpus cumucor coverage cumucov
10 384 384 1.83% 1.83% 71.80601% 71.81%
9 385 769 1.84% 3.67% 16.04815% 87.85%
8 387 1156 1.85% 5.52% 6.63060% 94.48%
7 388 1544 1.85% 7.38% 2.91512% 97.40%
6 389 1933 1.86% 9.23% 1.31031% 98.71%
5 390 2323 1.86% 11.10% 0.62271% 99.33%
4 403 2726 1.93% 13.02% 0.31526% 99.65%
3 427 3153 2.04% 15.06% 0.16066% 99.81%
2 472 3625 2.25% 17.32% 0.08600% 99.89%
1 669 4294 3.20% 20.51% 0.05370% 99.95%
0 16638 20932 79.49% 100.00% 0.05148% 100.00%
======= kanji_easy
rank items cumui corpus cumucor coverage cumucov
10 30 30 1.71% 1.71% 27.96203% 27.96%
9 32 62 1.83% 3.54% 13.32865% 41.29%
8 34 96 1.94% 5.49% 9.13495% 50.43%
7 35 131 2.00% 7.49% 7.04664% 57.47%
6 45 176 2.57% 10.06% 6.78670% 64.26%
5 62 238 3.54% 13.60% 7.26672% 71.53%
4 57 295 3.26% 16.86% 5.28403% 76.81%
3 76 371 4.34% 21.20% 5.34980% 82.16%
2 127 498 7.26% 28.46% 6.10557% 88.27%
1 201 699 11.49% 39.94% 5.73432% 94.00%
0 1051 1750 60.06% 100.00% 6.00058% 100.00%
======= stella
rank items cumui corpus cumucor coverage cumucov
10 367 367 0.41% 0.41% 54.25751% 54.26%
9 375 742 0.42% 0.83% 7.86507% 62.12%
8 387 1129 0.43% 1.26% 5.00073% 67.12%
7 449 1578 0.50% 1.77% 4.03590% 71.16%
6 474 2052 0.53% 2.30% 3.16995% 74.33%
5 559 2611 0.63% 2.92% 2.83934% 77.17%
4 753 3364 0.84% 3.77% 2.89696% 80.07%
3 1048 4412 1.17% 4.94% 3.02697% 83.09%
2 1822 6234 2.04% 6.98% 3.67054% 86.76%
1 4155 10389 4.65% 11.63% 4.83944% 91.60%
0 78926 89315 88.37% 100.00% 8.39759% 100.00%
Why does this concern you? I fully expect this to happen with some of the corpora, especially (e.g.) NHK easy. WK is going to teach you way more kanji and vocab than you need in order to read those articles, as by level 48 WK (supposedly) has you at 95% of N2 while NHK easy is geared towards N4 – meaning all the N3 and N2 vocab that WK teaches you is indeed “useless” for reading NHK easy.
Crap I think I misunderstood. When you say “no rank 0” you mean no 5 star, or no 0 star? What we have here is a failure to communicate!
It’s just that it’s less descriptive, and helpful. I mean, it’s like anything diminishing returns, the more you go in, the progressively less you get out. The point of a rating system is to give you an idea what sort of… quantile you’re in. It seems (gut feel) they should just be getting larger as they go, so that you have a greater chance of encountering a variety of designations. If there are any 0 star items in WaniKani it should pretty much be damning to Koichi’s efforts, but I’d like it to be just realistic that there are a few 1 star, and a few dozen 2 star items. Basically, a 0 star item is like… in this field there is literally no reason to learn that word, so I definitely don’t think 9% of the text should be made of 0 star items…
Gaah… maybe we should start by writing a glossary?
It’s 9% (or whatever) of that particular corpus that’s 0-star… we don’t know how many WK items are 0-star, unless you’ve already tested your corpus against all the WK vocab? The wikipedia corpus is far larger than the WK corpus, so the chances of anything on wk being looked up and you getting a 0 star back is exceedingly small.
Am I explaining this clearly? The number of 0-star (or 1 or even 2 star) items in a particular corpus doesn’t tell us anything about how the items in WK are going to rank against that corpus until we actually check it.
Your always-doubling feeling sounds like zipf’s law that I linked earlier, but that law doesn’t apply to corpora that are intentionally made up of a small subset of the language with a much different distribution.
In any case my point is this: You can’t draw any conclusions about how WK’s items are going to rank against any corpus by only looking at the ranks within that corpus. We have to actually look up the WK items and see how they rank.
I mean, the items in WK don’t really actually have a bearing on the ratings, right? A good word to know in a particular corpus is a good word to know. I guess more than anything I don’t think it’s a good approach if it results in any non-trivial corpus to have 9% by volume of its content rated at 0 stars. 0 stars should be less than a percent of any corpus (on a gut check), so as a heuristic reality check, count the number off the bottom it takes to get to 1% of coverage, and that would be the number it would be inappropriate to exceed for 0 stars.
They aren’t used in creating the corpora nor anything derived from them, of course. Is that what you mean? The corpora are there to look up WK tokens, so WK data itself cannot be used in constructing them or else there would be bias built into the system.
A good word to know in a field is a high ranking word in the corpus for that field. So for example, 5 star words on wikipedia are good to know if you want to read wikipedia, 5 star words from a corpus made of medical terms are essential words to know if you want to read medical journals. In both cases there are going to be words that are 5 star in one corpus and 1 star in the other.
In both cases, there are possibly going to be some low-ranking words from WK. I’ll pick some of the later ones and see how they rank in the sets I have now.
Here are four examples. I went to the kanji list for WK, went to the last set (Level 60) and took the first two. I then looked up both of them in all three of the post-processed corpora I have here.
The first one, 煩 (annoy) appears in all three. In wiki it’s rank 4 (out of 11) making it a 1.5 star, it appears only twice in all of NHK EZ giving it 0 stars there, and it occurs only 369 times in the file you sent, making it a 0.5 star.
The next one, 蛮 (barbarian) does not appear at all in your corpus or the NHK EZ corpus, making it also rank 0 in both of those. It shows up as a rank 4 in wiki though, just like the other one.
Both of those are JLPT N1 kanji.
I did the same thing with the first 2 level 60 WK vocab words, 娘婿 (adopted son in law) and '解剖学 (anatomy). Your corpus is the only one I have with vocab right now, and neither of these words appear in it at all – so, right now, they would be 0-star vocab words.
Four more, these are the first two from level 1 for kanji and vocab.
上. above. JLPT N5. wiki:5star, nhk:5star, stella:5star
下. below. JLPT N5. wiki:5star, nhk:3.5star, stella:5star
大人. adult.JLPT N5. wiki:never, nhk:never, stella:3.5star
一人. alone. JLPT N5. wiki:never, nhk:never, stella:0star (my rank… it is rating 1 in your file. It appears only 244 times.)
Hey, cool, thanks for running this down. I’m assuming when you say “stella” you mean the novel corpus? And those stars are based on your algorithm, right? My ratings were included in that JSON, I’d be curious how they compare…
- 上 - 5 stars
- 下 - 5 stars
- 大人 - 4 stars
- 一人 - 1 star
Yeah, well I mean, it’s trivial for me to include these in the files so I will. You’re free to ignore them, but I think I’m pretty convinced. At least for for demonstration purposes try including both and see how it looks.
In the meantime I’ll work towards pairing down that list to only the ones included in WK, and putting more than one corpus into the blob. Thanks so much for digging into this with me!
Right, the stars are my ranking, ‘stella’ is the file you sent me (the full one, not the small one).
For the files, you can put what you want in there, just don’t change the structure please (adding more fields is fine of course), and keep it to one corpus per file for now if you would. If there are existing vocab lists you want to use, you can point me to them or run them through your own processing first then send them my way.
I’m interested in solving the puzzle, no thanks required!
Alright! Trucking right along. Here’s a link to a google drive directory I’ll be dumping these per-corpora json blobs into. I trimmed down the novels one considerably, so I recommend re-downloading. Additionally, I put up the full wikipedia one as well. I expect I’ll need to keep toying with these, but they look pretty solid right now. I’m going to keep going with the words, and then I’ll double-back and retool for Kanji probably later.