Extension Suggestion: Corpora Rankings

So, is the thought then that you have an idea about geometric scaling that will result in a more bendy distribution? Are you still intending on rescaling the data?

I can’t precisely answer that because I don’t know how you’re defining “ranked”. If you have an item who’s “score” is higher than 90% of all the other scores, then that item is in the 10th percentile. If his score is only higher than 1% of the other scores, the he’s in the 99th percentile.

I am not rescaling anything at present, I am still getting a feel for how the data looks now that I myself am correctly calculating the percentile.

ETA: I’m pulling in your data now to see how it looks with my cruncher output as well. For future files can you do them as file attachments? Getting all that text selected was a little tedious. :wink:

I just realized that the example you provided is just a subsection of your data, right? Can you send me the whole file? I’ve written something to suck it in as a ‘raw’ then spit out my intermediate format before going on to the final format, so if you don’t change that format I’ll be able to easily handle it from here on. If you do want to change something in it just let me know.

Absolutely, sorry, didn’t realize you were going to use it. Here’s the unabridged data set, obviously still not wrapped as it will be in the over-blob, nor filtered by WK words yet. Here it is.

I’ll start working with other corpora later today. For the moment, time to get back to WK itself! :-p

While I think and wait on that file I thought taking another lap around what this data “means” Right now with no ‘massaging’ or anything else, just (correctly) calculating raw percentiles from the corpus, the wikipedia data would result in a star ranking that looks like this:

5 stars: 211 items. Only 1% of the corpus is covered, but 56.55% (by volume) of all of wikipedia is covered.
4.5 stars: +424 items (635 total). 2% more of the corpus. 83.81% of all ‘volume’.
4 stars: +425 items (1060 total), 2% more of the corpus. 93.34% of all occurrences.

Etc.

To me, again, this “feels” right. If You learned all 4star kanji from the wiki corpus, you would know 93% of all the kanji you encounter when reading wikipedia, even though you only had to learn about 5% of the corpus. Do you not agree with this?

As for the file, I had to do “something” with it – I couldn’t say if it looked good or bad by just looking at a few hundred lines of json with human eyeballs. :wink: I’ll try the new file now!

Totally. I am 100% onboard through 4.5 stars, those numbers look great and are pretty near to the ones I came up with. My problem with the breakdown is 3, 2.5, 2, 1.5 and 1 stars. My gut tells me that each of those groups should be something like twice as large as the one before it, just as 5 was from 4.5. Each of the ranks has about 450, except for 10, 1 and 0. What I think that will lead to is a high number of rank 0 items in WaniKani, and basically no 3, 2.5, 2, 1.5 or 1 star words.

To clarify, when I say rank, I mean position in the original frequencies document. There is only one rank 1, and it’s almost always の. I’ve been using the word “score” to represent something abstract, and in particular, a star rating, but now with this new algorithm it’s reasonable for me to also talk about scores more organically (as decimals, rather than integers).

I think primarily our disagreement comes down to whether we base our divisions on “rank” (as in, position within the document) or on volume/coverage/percentile. For me, rank is just kindof an artifact of the way the data is presented, and not valuable information in and of itself, whereas volume based percentile actually represents (regardless of other items in the document) the prevalence of the item in the language. Do you agree with my characterization? Do you feel like rank is a more appropriate way of determining usefulness than volume?

Ok, the data from that file definitely looks different than the wiki data – it’s closer to the NHK easy data. To me this indicates that the vocabulary is fairly specialized, and you’ll need all or most of it in order to be fluent within that particular arena. Here are all three of them. Again, raw percentile based ranking: I calculate it as “round((1 - percentile) * 10)”, which means the top 5% end up at rank 10, then it’s 10% chunks on down to the bottom. This is probably what you were asking about before and it wasn’t clicking with me – the rounding at either “end” only gets half a percentile, since only 9.5 and up round to 10, but 8.5 to 9.4 all round to 9. I can change this to a floor() and see how it looks.

Here it is with the round:

======= wikipedia
rank    items   cumui   corpus  cumucor coverage        cumucov
10      211     211      1.01%   1.01%  56.54846%       56.55%
9       424     635      2.03%   3.03%  27.26210%       83.81%
8       425     1060     2.03%   5.06%   9.52575%       93.34%
7       425     1485     2.03%   7.09%   3.75061%       97.09%
6       429     1914     2.05%   9.14%   1.57918%       98.67%
5       429     2343     2.05%  11.19%   0.68835%       99.35%
4       444     2787     2.12%  13.31%   0.32421%       99.68%
3       480     3267     2.29%  15.61%   0.15725%       99.84%
2       554     3821     2.65%  18.25%   0.07999%       99.92%
1       1027    4848     4.91%  23.16%   0.05313%       99.97%
0       16084   20932   76.84%  100.00%  0.03096%       100.00%
======= kanji_easy
rank    items   cumui   corpus  cumucor coverage        cumucov
10      16      16       0.91%   0.91%  19.63407%       19.63%
9       35      51       2.00%   2.91%  17.69248%       37.33%
8       37      88       2.11%   5.03%  11.20935%       48.54%
7       38      126      2.17%   7.20%   8.03893%       56.57%
6       49      175      2.80%  10.00%   7.55132%       64.13%
5       66      241      3.77%  13.77%   7.71259%       71.84%
4       65      306      3.71%  17.49%   5.86145%       77.70%
3       90      396      5.14%  22.63%   5.89180%       83.59%
2       157     553      8.97%  31.60%   6.65516%       90.25%
1       337     890     19.26%  50.86%   6.63429%       96.88%
0       860     1750    49.14%  100.00%  3.11856%       100.00%
======= stella
rank    items   cumui   corpus  cumucor coverage        cumucov
10      201     201      0.23%   0.23%  48.29940%       48.30%
9       410     611      0.46%   0.68%  11.56216%       59.86%
8       421     1032     0.47%   1.16%   6.18457%       66.05%
7       477     1509     0.53%   1.69%   4.56944%       70.62%
6       520     2029     0.58%   2.27%   3.57885%       74.19%
5       608     2637     0.68%   2.95%   3.08846%       77.28%
4       859     3496     0.96%   3.91%   3.22108%       80.50%
3       1278    4774     1.43%   5.35%   3.44624%       83.95%
2       2462    7236     2.76%   8.10%   4.30774%       88.26%
1       8066    15302    9.03%  17.13%   6.36061%       94.62%
0       74013   89315   82.87%  100.00%  5.38145%       100.00%

Here it is with the floor() (9.x goes to 9, 8.x goes to 8, etc) Note that this is multiplying the percentile by 11 instead of 10, so I can get 11 rankings, but that doesn’t really impact things itself much.

======= wikipedia
rank    items   cumui   corpus  cumucor coverage        cumucov
10      384     384      1.83%   1.83%  71.80601%       71.81%
9       385     769      1.84%   3.67%  16.04815%       87.85%
8       387     1156     1.85%   5.52%   6.63060%       94.48%
7       388     1544     1.85%   7.38%   2.91512%       97.40%
6       389     1933     1.86%   9.23%   1.31031%       98.71%
5       390     2323     1.86%  11.10%   0.62271%       99.33%
4       403     2726     1.93%  13.02%   0.31526%       99.65%
3       427     3153     2.04%  15.06%   0.16066%       99.81%
2       472     3625     2.25%  17.32%   0.08600%       99.89%
1       669     4294     3.20%  20.51%   0.05370%       99.95%
0       16638   20932   79.49%  100.00%  0.05148%       100.00%
======= kanji_easy
rank    items   cumui   corpus  cumucor coverage        cumucov
10      30      30       1.71%   1.71%  27.96203%       27.96%
9       32      62       1.83%   3.54%  13.32865%       41.29%
8       34      96       1.94%   5.49%   9.13495%       50.43%
7       35      131      2.00%   7.49%   7.04664%       57.47%
6       45      176      2.57%  10.06%   6.78670%       64.26%
5       62      238      3.54%  13.60%   7.26672%       71.53%
4       57      295      3.26%  16.86%   5.28403%       76.81%
3       76      371      4.34%  21.20%   5.34980%       82.16%
2       127     498      7.26%  28.46%   6.10557%       88.27%
1       201     699     11.49%  39.94%   5.73432%       94.00%
0       1051    1750    60.06%  100.00%  6.00058%       100.00%
======= stella
rank    items   cumui   corpus  cumucor coverage        cumucov
10      367     367      0.41%   0.41%  54.25751%       54.26%
9       375     742      0.42%   0.83%   7.86507%       62.12%
8       387     1129     0.43%   1.26%   5.00073%       67.12%
7       449     1578     0.50%   1.77%   4.03590%       71.16%
6       474     2052     0.53%   2.30%   3.16995%       74.33%
5       559     2611     0.63%   2.92%   2.83934%       77.17%
4       753     3364     0.84%   3.77%   2.89696%       80.07%
3       1048    4412     1.17%   4.94%   3.02697%       83.09%
2       1822    6234     2.04%   6.98%   3.67054%       86.76%
1       4155    10389    4.65%  11.63%   4.83944%       91.60%
0       78926   89315   88.37%  100.00%  8.39759%       100.00%

Why does this concern you? I fully expect this to happen with some of the corpora, especially (e.g.) NHK easy. WK is going to teach you way more kanji and vocab than you need in order to read those articles, as by level 48 WK (supposedly) has you at 95% of N2 while NHK easy is geared towards N4 – meaning all the N3 and N2 vocab that WK teaches you is indeed “useless” for reading NHK easy.

Crap I think I misunderstood. When you say “no rank 0” you mean no 5 star, or no 0 star? What we have here is a failure to communicate! :smiley:

It’s just that it’s less descriptive, and helpful. I mean, it’s like anything diminishing returns, the more you go in, the progressively less you get out. The point of a rating system is to give you an idea what sort of… quantile you’re in. It seems (gut feel) they should just be getting larger as they go, so that you have a greater chance of encountering a variety of designations. If there are any 0 star items in WaniKani it should pretty much be damning to Koichi’s efforts, but I’d like it to be just realistic that there are a few 1 star, and a few dozen 2 star items. Basically, a 0 star item is like… in this field there is literally no reason to learn that word, so I definitely don’t think 9% of the text should be made of 0 star items…

Gaah… maybe we should start by writing a glossary?

It’s 9% (or whatever) of that particular corpus that’s 0-star… we don’t know how many WK items are 0-star, unless you’ve already tested your corpus against all the WK vocab? The wikipedia corpus is far larger than the WK corpus, so the chances of anything on wk being looked up and you getting a 0 star back is exceedingly small.

Am I explaining this clearly? The number of 0-star (or 1 or even 2 star) items in a particular corpus doesn’t tell us anything about how the items in WK are going to rank against that corpus until we actually check it.

Your always-doubling feeling sounds like zipf’s law that I linked earlier, but that law doesn’t apply to corpora that are intentionally made up of a small subset of the language with a much different distribution.

In any case my point is this: You can’t draw any conclusions about how WK’s items are going to rank against any corpus by only looking at the ranks within that corpus. We have to actually look up the WK items and see how they rank.

I mean, the items in WK don’t really actually have a bearing on the ratings, right? A good word to know in a particular corpus is a good word to know. I guess more than anything I don’t think it’s a good approach if it results in any non-trivial corpus to have 9% by volume of its content rated at 0 stars. 0 stars should be less than a percent of any corpus (on a gut check), so as a heuristic reality check, count the number off the bottom it takes to get to 1% of coverage, and that would be the number it would be inappropriate to exceed for 0 stars.

They aren’t used in creating the corpora nor anything derived from them, of course. Is that what you mean? The corpora are there to look up WK tokens, so WK data itself cannot be used in constructing them or else there would be bias built into the system.

A good word to know in a field is a high ranking word in the corpus for that field. So for example, 5 star words on wikipedia are good to know if you want to read wikipedia, 5 star words from a corpus made of medical terms are essential words to know if you want to read medical journals. In both cases there are going to be words that are 5 star in one corpus and 1 star in the other.

In both cases, there are possibly going to be some low-ranking words from WK. I’ll pick some of the later ones and see how they rank in the sets I have now.

Here are four examples. I went to the kanji list for WK, went to the last set (Level 60) and took the first two. I then looked up both of them in all three of the post-processed corpora I have here.

The first one, 煩 (annoy) appears in all three. In wiki it’s rank 4 (out of 11) making it a 1.5 star, it appears only twice in all of NHK EZ giving it 0 stars there, and it occurs only 369 times in the file you sent, making it a 0.5 star.

The next one, 蛮 (barbarian) does not appear at all in your corpus or the NHK EZ corpus, making it also rank 0 in both of those. It shows up as a rank 4 in wiki though, just like the other one.

Both of those are JLPT N1 kanji.

I did the same thing with the first 2 level 60 WK vocab words, 娘婿 (adopted son in law) and '解剖学 (anatomy). Your corpus is the only one I have with vocab right now, and neither of these words appear in it at all – so, right now, they would be 0-star vocab words.

Four more, these are the first two from level 1 for kanji and vocab.

上. above. JLPT N5. wiki:5star, nhk:5star, stella:5star
下. below. JLPT N5. wiki:5star, nhk:3.5star, stella:5star
大人. adult.JLPT N5. wiki:never, nhk:never, stella:3.5star
一人. alone. JLPT N5. wiki:never, nhk:never, stella:0star (my rank… it is rating 1 in your file. It appears only 244 times.)

Hey, cool, thanks for running this down. I’m assuming when you say “stella” you mean the novel corpus? And those stars are based on your algorithm, right? My ratings were included in that JSON, I’d be curious how they compare…

  • 上 - 5 stars
  • 下 - 5 stars
  • 大人 - 4 stars
  • 一人 - 1 star

Yeah, well I mean, it’s trivial for me to include these in the files so I will. You’re free to ignore them, but I think I’m pretty convinced. At least for for demonstration purposes try including both and see how it looks.

In the meantime I’ll work towards pairing down that list to only the ones included in WK, and putting more than one corpus into the blob. Thanks so much for digging into this with me!

Right, the stars are my ranking, ‘stella’ is the file you sent me (the full one, not the small one).

For the files, you can put what you want in there, just don’t change the structure please (adding more fields is fine of course), and keep it to one corpus per file for now if you would. If there are existing vocab lists you want to use, you can point me to them or run them through your own processing first then send them my way.

I’m interested in solving the puzzle, no thanks required!

Alright! Trucking right along. Here’s a link to a google drive directory I’ll be dumping these per-corpora json blobs into. I trimmed down the novels one considerably, so I recommend re-downloading. Additionally, I put up the full wikipedia one as well. I expect I’ll need to keep toying with these, but they look pretty solid right now. I’m going to keep going with the words, and then I’ll double-back and retool for Kanji probably later.