Extension Suggestion: Corpora Rankings

Small bug fixed, uploaded the new results. A quick tally for my stars, here’s what’s in WK:

	Novels	Wiki
5.0	  45	 33
4.5	 227	272
4.0	 340	368
3.5	 400	448
3.0	 438	529
2.5	 549	648
2.0	 551	728
1.5	 604	829
1.0	 729	898
0.5	 913	732
0.0	 764	102

For the most part it looks like Koichi did a reasonable job, and that the novels corpus is probably a little silly. The 0 stars score for Wikipedia is impressive.

I literally just downloaded ran my processing on both files, not sure if I missed the new version or not. Grabbing 'em again.

When I run it with the same few tests as before I get

上         5    5
下         5    5
大人       4    4
一人       0    0.5
煩         none none
蛮         none none
娘婿       none none
解剖学      none none

First value is from your wiki file, 2nd is from novels file.

Are you able to run your algorithm and then tally all of the results based on their star rating?

Sure, gimme five.

===================
wikipedia
5.0 stars: 384
4.5 stars: 385
4.0 stars: 387
3.5 stars: 388
3.0 stars: 389
2.5 stars: 390
2.0 stars: 403
1.5 stars: 427
1.0 stars: 472
0.5 stars: 669
0.0 stars: 16638
===================
nhk1
5.0 stars: 30
4.5 stars: 32
4.0 stars: 34
3.5 stars: 35
3.0 stars: 45
2.5 stars: 62
2.0 stars: 57
1.5 stars: 76
1.0 stars: 127
0.5 stars: 201
0.0 stars: 1051
===================
wikipedia-words
5.0 stars: 431
4.5 stars: 432
4.0 stars: 434
3.5 stars: 447
3.0 stars: 442
2.5 stars: 465
2.0 stars: 483
1.5 stars: 508
1.0 stars: 542
0.5 stars: 619
0.0 stars: 784
===================
novels-words
5.0 stars: 231
4.5 stars: 238
4.0 stars: 238
3.5 stars: 260
3.0 stars: 264
2.5 stars: 309
2.0 stars: 346
1.5 stars: 416
1.0 stars: 504
0.5 stars: 720
0.0 stars: 2034

Very interesting… How do you feel about those results?

I’m ambivalent. I’d like to see the growth a bit more exponential looking but overall I think it’s probably fairly accurate. Do you know of any preexisting list of all the WK kanji & vocab? I’d like to see how many WK items there are at each rank for each corpus. I’ve held off doing it yet just because I don’t feel like manually copy/pasting every kanji & vocab page into a document and then manually turning it into a list I can process. :wink:

Here’s the same thing showing total occurrences (‘volume’) rather than simply # of items within the corpus. It looks maybe closer to what you were expecting?

===================
wikipedia
5.0 stars: 563364442
4.5 stars: 125908111
4.0 stars: 52021331
3.5 stars: 22871033
3.0 stars: 10280231
2.5 stars: 4885554
2.0 stars: 2473432
1.5 stars: 1260456
1.0 stars: 674703
0.5 stars: 421330
0.0 stars: 403900
===================
nhk1
5.0 stars: 44213
4.5 stars: 21075
4.0 stars: 14444
3.5 stars: 11142
3.0 stars: 10731
2.5 stars: 11490
2.0 stars: 8355
1.5 stars: 8459
1.0 stars: 9654
0.5 stars: 9067
0.0 stars: 9488
===================
wikipedia-words
5.0 stars: 118769153
4.5 stars: 25054249
4.0 stars: 13279170
3.5 stars: 8263976
3.0 stars: 5351982
2.5 stars: 3762371
2.0 stars: 2554782
1.5 stars: 1708938
1.0 stars: 1114495
0.5 stars: 653753
0.0 stars: 230741
===================
novels-words
5.0 stars: 6454972
4.5 stars: 1873325
4.0 stars: 1160410
3.5 stars: 868633
3.0 stars: 622583
2.5 stars: 520851
2.0 stars: 425683
1.5 stars: 371032
1.0 stars: 299420
0.5 stars: 243498
0.0 stars: 174773

I wish I was in academics so I could get the BCCWJ corpus on DVD for relatively cheap. The DVD has all the raw data, which would be fantastic for research. The price for non-academic is way too high for me to justify. :crying_cat_face:

Stella, what’s the source of the wiki-words data? It really does not seem right to me… the number of words at each percentile should be falling much faster than it does. It seems like the data is incomplete still, or like there is maybe a rounding or overflow error in the volume field. The total I arrive at if I sum wiki-words volumes is 180743610. Does that match what you have?

I hadn’t heard of this, but it appears the data is free to the public. Going to poke around!

I haven’t looked at the volumes total… There’s only ~5.5k words in the list, I removed all words not in WK.

You should put them back in, that’s seriously screwing up the ranks!

I mean… the original data is available here as a TSV. Alternatively, if all you need is the total…

In total, 669,419,716 occurrences of 2,610,776 lemmas were counted.

No, I need to calculate ranks with all the data, not just the subset of data that also exists in WK. Excluding all the other words causes the ranking to only rank WK content, which means that by definition there will be WK content at 0 rank. It’s possible (and perhaps likely) that by counting all the data, all the items from WK will be bumped up to 3 or 4 from the 0…2 slots.

I should say excluding all those words AFTER the rank calculation is done, to shrink the file size, is fine. Doing it BEFORE the rank is calculated is disastrous to the ranking algorithm. If your scores were calculated before you excluded the data, then they’re fine.

If I pull in the TSV but limit it to the first 100,000 records (in an effort to eliminate the crazy edge cases, punctuation, etc), the scoring I get for the WK test data is:

上: 5 stars
下: 5 stars
大人: 5 stars
一人: 2 stars
煩: 1 star
蛮: 3 stars

A bump to 200,000 rows takes quite a while for my script to chew but significantly changes the results. Specifically the 1-3 star guys.

一人: 2 stars → 3.5 stars
煩: 1 star → 3 stars
蛮: 3 stars → 4 stars

As for 娘婿 and 解剖学, those two level 60 WK vocab words do not appear in the TSV at all, so either WK is feeding us some pretty random vocab at level 60, or the wiki data isn’t representative.

I’ll continue to poke and prod at this tomorrow, as I do agree that the conversion of a percentile to a rank isn’t quite where I feel it should be; It’s the star ranking that should be logarithmic I think, so that a 4 star is twice as important as a 3star, and a 3star twice as important as a 2star. Does that sound like it might be reasonable?

I don’t think the raw tagged data is available. You can access certain kinds of info through Kotonoha.

Different ranking. This time I start with the top 0.1 percentile and keep doubling it until I can’t anymore. Rank 10 is 0.999 (1.0 - 0.001), rank 9 is 0.997 (0.999 - (2 * 0.001)), rank 8 is 0.993 (0.997 - (2 * 0.002)), and so on until rank 1 which is anything below 0.489. Here are the statistics for my two kanji lists and the wikipedia TSV’s first 200,000 items.

======= wikipedia-kanji
rank    items   cumui   corpus  cumucor coverage        cumucov
10      21      21       0.10%   0.10%  16.23999%       16.24%
9       42      63       0.20%   0.30%  14.06339%       30.30%
8       84      147      0.40%   0.70%  17.64539%       47.95%
7       167     314      0.80%   1.50%  18.78062%       66.73%
6       335     649      1.60%   3.10%  17.57417%       84.30%
5       670     1319     3.20%   6.30%  11.69022%       95.99%
4       1339    2658     6.40%  12.70%   3.61576%       99.61%
3       2676    5334    12.78%  25.48%   0.36982%       99.98%
2       5267    10601   25.16%  50.64%   0.01878%       100.00%
1       10331   20932   49.36%  100.00%  0.00186%       100.00%
======= kanji_easy
rank    items   cumui   corpus  cumucor coverage        cumucov
10      2       2        0.11%   0.11%   5.76342%        5.76%
9       3       5        0.17%   0.29%   4.23861%       10.00%
8       7       12       0.40%   0.69%   6.77026%       16.77%
7       14      26       0.80%   1.49%   9.05716%       25.83%
6       28      54       1.60%   3.09%  12.64119%       38.47%
5       56      110      3.20%   6.29%  15.01727%       53.49%
4       112     222      6.40%  12.69%  16.31883%       69.81%
3       226     448     12.91%  25.60%  16.35867%       86.17%
2       442     890     25.26%  50.86%  10.71605%       96.88%
1       860     1750    49.14%  100.00%  3.11856%       100.00%
======= wikipedia-20150422-lemmas
rank    items   cumui   corpus  cumucor coverage        cumucov
10      200     200      0.10%   0.10%  49.20250%       49.20%
9       400     600      0.20%   0.30%  10.05814%       59.26%
8       800     1400     0.40%   0.70%   9.06969%       68.33%
7       1600    3000     0.80%   1.50%   8.38129%       76.71%
6       3200    6200     1.60%   3.10%   7.50378%       84.22%
5       6403    12603    3.20%   6.30%   6.08406%       90.30%
4       12795   25398    6.40%  12.70%   4.47309%       94.77%
3       25625   51023   12.81%  25.51%   2.84684%       97.62%
2       51167   102190  25.58%  51.09%   1.58420%       99.20%
1       97810   200000  48.91%  100.00%  0.79641%       100.00%

Here are the sample lookup scores, in the same order as the above corpora. -1 means the item does not appear in that corpus. The star rank is just this number divided by two.

上:   9/7/10
下:   8/4/9
大人: -1/-1/7
一人: -1/-1/2
煩:   4/1/2
蛮:   4/-1/3