Small bug fixed, uploaded the new results. A quick tally for my stars, here’s what’s in WK:
Novels Wiki
5.0 45 33
4.5 227 272
4.0 340 368
3.5 400 448
3.0 438 529
2.5 549 648
2.0 551 728
1.5 604 829
1.0 729 898
0.5 913 732
0.0 764 102
For the most part it looks like Koichi did a reasonable job, and that the novels corpus is probably a little silly. The 0 stars score for Wikipedia is impressive.
I literally just downloaded ran my processing on both files, not sure if I missed the new version or not. Grabbing 'em again.
When I run it with the same few tests as before I get
上 5 5
下 5 5
大人 4 4
一人 0 0.5
煩 none none
蛮 none none
娘婿 none none
解剖学 none none
First value is from your wiki file, 2nd is from novels file.
Are you able to run your algorithm and then tally all of the results based on their star rating?
===================
wikipedia
5.0 stars: 384
4.5 stars: 385
4.0 stars: 387
3.5 stars: 388
3.0 stars: 389
2.5 stars: 390
2.0 stars: 403
1.5 stars: 427
1.0 stars: 472
0.5 stars: 669
0.0 stars: 16638
===================
nhk1
5.0 stars: 30
4.5 stars: 32
4.0 stars: 34
3.5 stars: 35
3.0 stars: 45
2.5 stars: 62
2.0 stars: 57
1.5 stars: 76
1.0 stars: 127
0.5 stars: 201
0.0 stars: 1051
===================
wikipedia-words
5.0 stars: 431
4.5 stars: 432
4.0 stars: 434
3.5 stars: 447
3.0 stars: 442
2.5 stars: 465
2.0 stars: 483
1.5 stars: 508
1.0 stars: 542
0.5 stars: 619
0.0 stars: 784
===================
novels-words
5.0 stars: 231
4.5 stars: 238
4.0 stars: 238
3.5 stars: 260
3.0 stars: 264
2.5 stars: 309
2.0 stars: 346
1.5 stars: 416
1.0 stars: 504
0.5 stars: 720
0.0 stars: 2034
Very interesting… How do you feel about those results?
I’m ambivalent. I’d like to see the growth a bit more exponential looking but overall I think it’s probably fairly accurate. Do you know of any preexisting list of all the WK kanji & vocab? I’d like to see how many WK items there are at each rank for each corpus. I’ve held off doing it yet just because I don’t feel like manually copy/pasting every kanji & vocab page into a document and then manually turning it into a list I can process.
Here’s the same thing showing total occurrences (‘volume’) rather than simply # of items within the corpus. It looks maybe closer to what you were expecting?
===================
wikipedia
5.0 stars: 563364442
4.5 stars: 125908111
4.0 stars: 52021331
3.5 stars: 22871033
3.0 stars: 10280231
2.5 stars: 4885554
2.0 stars: 2473432
1.5 stars: 1260456
1.0 stars: 674703
0.5 stars: 421330
0.0 stars: 403900
===================
nhk1
5.0 stars: 44213
4.5 stars: 21075
4.0 stars: 14444
3.5 stars: 11142
3.0 stars: 10731
2.5 stars: 11490
2.0 stars: 8355
1.5 stars: 8459
1.0 stars: 9654
0.5 stars: 9067
0.0 stars: 9488
===================
wikipedia-words
5.0 stars: 118769153
4.5 stars: 25054249
4.0 stars: 13279170
3.5 stars: 8263976
3.0 stars: 5351982
2.5 stars: 3762371
2.0 stars: 2554782
1.5 stars: 1708938
1.0 stars: 1114495
0.5 stars: 653753
0.0 stars: 230741
===================
novels-words
5.0 stars: 6454972
4.5 stars: 1873325
4.0 stars: 1160410
3.5 stars: 868633
3.0 stars: 622583
2.5 stars: 520851
2.0 stars: 425683
1.5 stars: 371032
1.0 stars: 299420
0.5 stars: 243498
0.0 stars: 174773
I wish I was in academics so I could get the BCCWJ corpus on DVD for relatively cheap. The DVD has all the raw data, which would be fantastic for research. The price for non-academic is way too high for me to justify.
Stella, what’s the source of the wiki-words data? It really does not seem right to me… the number of words at each percentile should be falling much faster than it does. It seems like the data is incomplete still, or like there is maybe a rounding or overflow error in the volume field. The total I arrive at if I sum wiki-words volumes is 180743610. Does that match what you have?
I hadn’t heard of this, but it appears the data is free to the public. Going to poke around!
I haven’t looked at the volumes total… There’s only ~5.5k words in the list, I removed all words not in WK.
You should put them back in, that’s seriously screwing up the ranks!
I mean… the original data is available here as a TSV. Alternatively, if all you need is the total…
In total, 669,419,716 occurrences of 2,610,776 lemmas were counted.
No, I need to calculate ranks with all the data, not just the subset of data that also exists in WK. Excluding all the other words causes the ranking to only rank WK content, which means that by definition there will be WK content at 0 rank. It’s possible (and perhaps likely) that by counting all the data, all the items from WK will be bumped up to 3 or 4 from the 0…2 slots.
I should say excluding all those words AFTER the rank calculation is done, to shrink the file size, is fine. Doing it BEFORE the rank is calculated is disastrous to the ranking algorithm. If your scores were calculated before you excluded the data, then they’re fine.
If I pull in the TSV but limit it to the first 100,000 records (in an effort to eliminate the crazy edge cases, punctuation, etc), the scoring I get for the WK test data is:
上: 5 stars
下: 5 stars
大人: 5 stars
一人: 2 stars
煩: 1 star
蛮: 3 stars
A bump to 200,000 rows takes quite a while for my script to chew but significantly changes the results. Specifically the 1-3 star guys.
一人: 2 stars → 3.5 stars
煩: 1 star → 3 stars
蛮: 3 stars → 4 stars
As for 娘婿 and 解剖学, those two level 60 WK vocab words do not appear in the TSV at all, so either WK is feeding us some pretty random vocab at level 60, or the wiki data isn’t representative.
I’ll continue to poke and prod at this tomorrow, as I do agree that the conversion of a percentile to a rank isn’t quite where I feel it should be; It’s the star ranking that should be logarithmic I think, so that a 4 star is twice as important as a 3star, and a 3star twice as important as a 2star. Does that sound like it might be reasonable?
I don’t think the raw tagged data is available. You can access certain kinds of info through Kotonoha.
Different ranking. This time I start with the top 0.1 percentile and keep doubling it until I can’t anymore. Rank 10 is 0.999 (1.0 - 0.001), rank 9 is 0.997 (0.999 - (2 * 0.001)), rank 8 is 0.993 (0.997 - (2 * 0.002)), and so on until rank 1 which is anything below 0.489. Here are the statistics for my two kanji lists and the wikipedia TSV’s first 200,000 items.
======= wikipedia-kanji
rank items cumui corpus cumucor coverage cumucov
10 21 21 0.10% 0.10% 16.23999% 16.24%
9 42 63 0.20% 0.30% 14.06339% 30.30%
8 84 147 0.40% 0.70% 17.64539% 47.95%
7 167 314 0.80% 1.50% 18.78062% 66.73%
6 335 649 1.60% 3.10% 17.57417% 84.30%
5 670 1319 3.20% 6.30% 11.69022% 95.99%
4 1339 2658 6.40% 12.70% 3.61576% 99.61%
3 2676 5334 12.78% 25.48% 0.36982% 99.98%
2 5267 10601 25.16% 50.64% 0.01878% 100.00%
1 10331 20932 49.36% 100.00% 0.00186% 100.00%
======= kanji_easy
rank items cumui corpus cumucor coverage cumucov
10 2 2 0.11% 0.11% 5.76342% 5.76%
9 3 5 0.17% 0.29% 4.23861% 10.00%
8 7 12 0.40% 0.69% 6.77026% 16.77%
7 14 26 0.80% 1.49% 9.05716% 25.83%
6 28 54 1.60% 3.09% 12.64119% 38.47%
5 56 110 3.20% 6.29% 15.01727% 53.49%
4 112 222 6.40% 12.69% 16.31883% 69.81%
3 226 448 12.91% 25.60% 16.35867% 86.17%
2 442 890 25.26% 50.86% 10.71605% 96.88%
1 860 1750 49.14% 100.00% 3.11856% 100.00%
======= wikipedia-20150422-lemmas
rank items cumui corpus cumucor coverage cumucov
10 200 200 0.10% 0.10% 49.20250% 49.20%
9 400 600 0.20% 0.30% 10.05814% 59.26%
8 800 1400 0.40% 0.70% 9.06969% 68.33%
7 1600 3000 0.80% 1.50% 8.38129% 76.71%
6 3200 6200 1.60% 3.10% 7.50378% 84.22%
5 6403 12603 3.20% 6.30% 6.08406% 90.30%
4 12795 25398 6.40% 12.70% 4.47309% 94.77%
3 25625 51023 12.81% 25.51% 2.84684% 97.62%
2 51167 102190 25.58% 51.09% 1.58420% 99.20%
1 97810 200000 48.91% 100.00% 0.79641% 100.00%
Here are the sample lookup scores, in the same order as the above corpora. -1 means the item does not appear in that corpus. The star rank is just this number divided by two.
上: 9/7/10
下: 8/4/9
大人: -1/-1/7
一人: -1/-1/2
煩: 4/1/2
蛮: 4/-1/3