Extension Suggestion: Corpora Rankings

I have downloaded the two new files and am incorporating them into the combined file now. I do have one question though, did you eliminate the non-WK words from these as well? If so, what is the source for these two? Removing the non-WK words before calculating the ranking is like eliminating numbers outside of a certain range before calculating an average – it really invalidates the calculation. In this case, it will (most likely) hurt the ranking of every item in the list.

The combined file I’ve generated has two scores in it, and can easily be extended to more or reduced to just one. The two scores in it now are my 11-step percentile based calculation, and the formula you gave me approximating your manual division.

When I run my test lookup against all six of the corpora I have now, this is the result. Scores are from -1 (not present) to 10 (5 stars). The first value is “my” score, the second is “your” score – from your formula, not the one you provided, since that only exists in the files you provided and not the ones I generated myself.

token   file 1  file 2  file 3  file 4  file 5  file 6  file 7  file 8
上      10 10   10 10   10 10   10 10   10 10   10 10   10 10   10 10
下      10 10   10 7    10 9    10 10   10 8    10 9    10 10   10 9
大人    -1      -1      -1      10 10   9 4     10 7    9 5     8 3
一人    -1      -1      -1      9 5     -1      10 6    4 0     0 0
煩      9 5     1 0     2 0     9 4     -1      -1      -1      -1
蛮      9 5     -1      3 0     10 7    -1      -1      -1      -1
file 1 == wikipedia-kanji
file 2 == kanji_easy
file 3 == nozaki
file 4 == wikipedia-20150422-lemmas
file 5 == stella-internet-words
file 6 == stella-news-words
file 7 == stella-novels-words
file 8 == stella-wikipedia-words

The score I’m calculating from you looks a bit off here. I haven’t compared it with what’s in your file, but thinking back to the confusion we had, it’s possible you meant something other than percentile in the formula? If so I’ll re-run this. I mention it because in our discussion your score was typically turning out higher values than mine (as you were concerned about too many 0-2 star items) but the reverse seems to be true here – my scores are consistently higher when the item exists in the corpus.

Of course, if all of your files are sent to me with the non-WK words already removed, that would skew the results in exactly this way.

I absolutely do not care about this. The script will be released under the GPL. You and anyone else are free to do anything you want with it (except release it under a different license, of course).

I can accept that, I just don’t like the unscientific nature of the arbitrary divisions, and I also don’t really understand what you’re after (your “requirements”) with the scores. It sounded to me like you wanted to keep massaging the data to reduce the number of low scoring items, which seems a bit dishonest and counterproductive. If a word in WK really is unimportant to understanding a corpus, then the score should reflect that, otherwise what’s the point?

Still WIP eyecandy. Score A is mine, score B is your formula as calculated by me:


Oooh! Pretty!

Yeah, there’s a bunch of confusion here…

Here’s the data for 有る:

  • Internet: {"volume":8698,"rank":1953,"percentage":0.000034369695640634745,"percentile":0.391009658575058,"rating":"5.0"}
  • News: {"volume":912,"rank":6840,"percentage":0.000015193455510598142,"percentile":0.8826722502708435,"rating":"2.0"}
  • Novels: {"volume":3064,"rank":1746,"percentage":0.00006929067603778094,"percentile":0.7238367795944214,"rating":"3.5"}
  • Wikipedia: {"volume":11121,"rank":5838,"percentage":0.000020771049094037153,"percentile":0.7841780781745911,"rating":"3.0"}

As such, the ratings should be: 5.0, 2.0, 3.5 and 3.0, rather than the 1.0, 0.5, 2.0 and 0.5 you have displayed. There’s not calculation necessary, the numbers are just there in plaintext. If you want to reproduce the scores, you’d have to start my work over again, I’m not sure why you would want to do that (just as an intellectual exercise?). The sources are all listed on this page, I didn’t have to go searching for anything more, just poke around on the links above if you want to redo the calculations.

And yes, all of those lists have non-wk words removed. I do so after calculating the scores to limit the amount of storage and bandwidth required, and to simplify lookup and minimize performance impact of lookup for display.

Oh, speaking of bandwidth limiting, it also occurred to me that it would probably be best if we considered going with a TSV rather than a json to reduce the size of the files as much as possible. It’s probably of minimal performance impact to do the conversion, but it might genuinely reduce the size of the files by as much as half. Something to consider…

That explain things.

It’s not that I “want” to, it’s that the script I’ve written to process the lists just treats all the files in the same way, the only difference is in the first step when it’s reading a file up initially, since they’re all in different formats. I treat all of them identically after that, indexing on the character/word and calculating scores. Since not all files (obviously) have your score in them, I must calculate it myself, so I just do it for all of them.

Of course I can work around the issue by adding special handling for your files, but just running the original files would be easier. I’ll poke around the thread.

I’m not (yet) concerned about the file size. If I don’t pretty-print the file and I remove everything except the scores, it’s under 1MB, and as I mentioned before the file is only going to be downloaded once and then cached in the browser indefinitely. Changing it to a CSV (or TSV or *SV) would certainly shrink it further, but it doesn’t seem worth the cost of the additional script processing required on the file.

Wow, it is a great idea. Super useful.

I don’t know if it’s possible, but it would be cool if it could somehow check the words in Google and rank them by how many pages they appear.

I’m always impressed by how great the wk community is.

I’ve done this and I want to double check. Do these ‘internal’ names from you match up with the sources you used?

internet-words: http://corpus.leeds.ac.uk/frqc/internet-jp.num
news-words: ???
novels-words: “base_aggregates.zip” from http://ftp.monash.edu.au/pub/nihongo/00INDEX.html
wikipedia-words: Is this the wiktionary (not wikipedia) list? https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Japanese2015_10000

Looks about right. news-words is from the same place as the novels, and it’s titled “edict_dupefree_freq_distribution”. Careful with the wiktionary one to get the full 95 MB version, not the 20k abridged one.

Will do. It’ll probably be some time tomorrow before I report in, as I’ll have to duplicate some of your work – preprocessing to remove punctuation and such. Also, full disclosure, I started the early access Mass Effect demo/preview yesterday and really like it, so it’s going to be eating into my free time for a bit. Apologies in advance! :smiley:

Sorry if this has been discussed, but is there any such thing as a “conversation corpus”?

I mean… does the Twitter corpus count?

I guess it’s the closest thing, but not really. People don’t talk exactly the same way as they do in twitter as they do in conversation. Just wasn’t sure if any research had been done on conversation word frequency.

Maybe. Although, to be clear, word “corpus” implies body of written work, so if you’re googling around you might try something more like “conversation word frequency list”, because corpora are always written.

Ok, I’ve got the leeds file in and processed and the kamerman novels file is running now. I’ll compare the data I get from calculating the scores to the values you provided when that is finished. The wiktionary file I already have, it’s the one listed as “wikipedia-20150422-lemmas”, which is the filename.

I need to finish cleaning up the js and get the display in place on the lesson and review pages, as it’s still only showing on the vocab & kanji pages you see above, then it’ll be ready for a first release. I want to figure out if I can make the stars look better (the outline is a partially functional hack) and also more cleanly eliminate (but still list) corpora that don’t have the item in them, but those things probably shouldn’t delay a release.

That’s all assuming my score lines up with yours when I’m done processing this file. :wink:

I still have some discrepancies here I can’t explain but it looks like they’re related to differences in processing the raw file. For example, this is the entry from the wikipedia words file you sent me:

    "有る": {
        "volume": 11121,
        "rank": 5853,
        "percentage": 1.6612895706203e-5,
        "percentile": 0.82738345861435,
        "rating": "2.5"
    },

When I process the raw file myself (wikipedia-20150422-lemmas.tsv) I end up with:

        "有る": {
            "rank": 5853,
            "occur": 11121,
            "freq": 1.6699924157927e-5,
            "perc": 0.988295,
            "irank": 10,
            "srank": 9
        },

It’s obvious that we’re calculating the frequency a little differently, and the percentile very differently. This might be due to how many entries we both used from the main file. I used the first 500,000, did you use more or less? If you used significantly less, I could see your percentile and thus rating for this word falling by quite a bit.

Took a little while for some new lessons to come up (full disclosure: I completely forgot the last time or two) for me to test with. Ignoring the drop shadow white/grey boxing for a minute, as well as the scores, there’s a bit of an issue with showing the data on the lesson page. A picture is worth 1000 so…

This may be a dumb question, but some of the words in WaniKani are shown in kanji, but are often not written using kanji. To name a few: 「大した」, 「有る」, 「平仮名」, 「片仮名」, 「大体」, etc.

If the search you are doing in these corpora looks for the kanji string, some very common words will have artificially low values (such as 「有る」 not being found in Wikipedia).

You might have already considered this (in which case feel free to ignore this), but it’s worth pointing out if not.

Not a stupid question at all. It really is a problem. Unfortunately, each corpus builder will have addressed that as they saw fit, and most of them don’t say how they handled it. For the time being, I’ve just acted naively, and treated them as separate words.

Not dead!

Life is going to be extremely complicated through April, so I wouldn’t expect any more progress for a bit after this, but I have done some work this morning that I wanted to share with y’all.

I combined all the vocab data that I want to use (in what I guess will be my version of this extension) and I’ve got it to a permanent public host. In case anyone wants at it, it’s here. I’ll be updating it in that location as I move forward, but it shouldn’t be any problem to take on the bandwidth costs from that, as it’s S3, so it’s very cheap!

Also, I’ve been playing with the data. I had this curiosity about what the data would show if I took one of the corpora and built a graph demonstrating the distribution of the ratings of all of the words in a particular WK level, and then stringing them all together. So here’s what it looks like:

It’s a bit hard to read, but you can generally see that the higher rating words become fewer and fewer as you go up in levels. It seems like there is a real turning point at around level 20-25 where it turns from being mostly 3 star and greater to being mostly less.

Let me know if you find this stuff interesting and I can post other corpora’s data.

:heart:

3 Likes

Good work guys! がんばって

I had some inspiration on how to approach the issue of the Anime corpus. You guys may have already come up with this, but I offer it here for your consideration:

Subtitle files

Such files tend to be available for free from various, if admittedly sketchy, 3rd parties, and could be used to build a solid frequency list with minimal parsing.

Really looking forward to the finished product!

Yup. That was the plan. If you look at the original post I wrote “Anime/Drama subs”.

I only know of the one site… kitunekko.net, but if you have more suggestions about where to get reliably good subs I would be very grateful!

1 Like