Extension Suggestion: Corpora Rankings

Added two more lists to the pile.

One is from a corpus of the entire Internet, and it is of probably the lowest quality that I would want to use at all. It’s bloated with particles, punctuation and roman characters, and there isn’t much I can do to clean it up without really throwing off the quality calculations. It was also collected in 2004. ;_; It’s also badly truncated, so it’s liable to the shortest list I put up. That said, it’s still almost 2/3 of the words on WK, and I figure its relevant, even if the ratings are a bit depressed.

The second one is based on years of 読売 and 毎日 newspaper data, and is very high quality. Probably the best I’ve seen so far. It comes from 2010, and processed nicely into almost every single word in WK. I stripped out just the particles, and the distributions look great.

Looking at this data I’m realizing I want to spend some more time processing and trying to filter out less useful entries. I don’t think I’m going to be able to do much with that internet data short of going to the source (something that is possible, given the tool that generated it is available and open source) but will require a deeper dive.

So what’s left for words: we talked about NHK Easy, but given the we have such a superior data set for two other news sources (albeit fully native/non-children’s) I’m a little unsure how much value there is given the jankiness of its provenance. The other one is the vaporware that is my anime corpus. Yeah… I definitely still intend on getting to it, but it’s really going to take some time…

I’m going to spend some more time on data cleanup today and then head off to do other things. Hopefully this is enough to go on!

Did the cleanup I wanted on the newspaper and wikipedia data. Tried to do some cleanup on the internet data and it… fell apart a little. I’m going to stop doing that for now. <_<

For some reason I didn’t get notified about these most recent replies, sorry for not responding! I have been busy cleaning up my processing script and adding some more functionality to it. I have a ‘combined’ final format for the json that can keeps the data separated by corpus and supports multiple different rankings. When all is said and done, the whole file is only about 2M and still has more than it really needs. It has rankings for every WK item in all the corpora that I have: the topokanji wikipedia file, the nozaki kanji file, the full wikipedia file you linked to (I used the first 500,000), the NHK easy kanji file, and both of the files I got from you.

The two lists you mentioned in your last post but one, where are they?

Hey. All four of the finalized word lists are available from the same google drive link above.

Some stuff has come up in my personal life, and I’m going to need to put off getting back to this for something like a month. That said, as far as I’m concerned, for my goals in this project, with those four files, all that’s needed is the tampermonkey script for grabbing the data and displaying it. If you want to go into kanji frequency, or you prefer a different way of calculating the rating, then of course have at it. Sorry I didn’t get around to the kanji, nor the subtitles corpus, I will double back for both of those.

As for the ratings… I dunno man, I’m really satisfied with my frequency based approach and I don’t think I’m going to be persuaded no matter how you massage the data by a ranking based approach. If you want to go that direction, I don’t know, would you mind horribly if I swapped out your data for mine and put out a different tampermonkey script? All the data is there and exactly how I had hoped for it for my purposes… I really don’t mean to be a jerk or anything, I might be totally wrong from a statisticians point of view, but it all looks so right to me!

Thanks so much for digging into this with me!

I have downloaded the two new files and am incorporating them into the combined file now. I do have one question though, did you eliminate the non-WK words from these as well? If so, what is the source for these two? Removing the non-WK words before calculating the ranking is like eliminating numbers outside of a certain range before calculating an average – it really invalidates the calculation. In this case, it will (most likely) hurt the ranking of every item in the list.

The combined file I’ve generated has two scores in it, and can easily be extended to more or reduced to just one. The two scores in it now are my 11-step percentile based calculation, and the formula you gave me approximating your manual division.

When I run my test lookup against all six of the corpora I have now, this is the result. Scores are from -1 (not present) to 10 (5 stars). The first value is “my” score, the second is “your” score – from your formula, not the one you provided, since that only exists in the files you provided and not the ones I generated myself.

token   file 1  file 2  file 3  file 4  file 5  file 6  file 7  file 8
上      10 10   10 10   10 10   10 10   10 10   10 10   10 10   10 10
下      10 10   10 7    10 9    10 10   10 8    10 9    10 10   10 9
大人    -1      -1      -1      10 10   9 4     10 7    9 5     8 3
一人    -1      -1      -1      9 5     -1      10 6    4 0     0 0
煩      9 5     1 0     2 0     9 4     -1      -1      -1      -1
蛮      9 5     -1      3 0     10 7    -1      -1      -1      -1
file 1 == wikipedia-kanji
file 2 == kanji_easy
file 3 == nozaki
file 4 == wikipedia-20150422-lemmas
file 5 == stella-internet-words
file 6 == stella-news-words
file 7 == stella-novels-words
file 8 == stella-wikipedia-words

The score I’m calculating from you looks a bit off here. I haven’t compared it with what’s in your file, but thinking back to the confusion we had, it’s possible you meant something other than percentile in the formula? If so I’ll re-run this. I mention it because in our discussion your score was typically turning out higher values than mine (as you were concerned about too many 0-2 star items) but the reverse seems to be true here – my scores are consistently higher when the item exists in the corpus.

Of course, if all of your files are sent to me with the non-WK words already removed, that would skew the results in exactly this way.

I absolutely do not care about this. The script will be released under the GPL. You and anyone else are free to do anything you want with it (except release it under a different license, of course).

I can accept that, I just don’t like the unscientific nature of the arbitrary divisions, and I also don’t really understand what you’re after (your “requirements”) with the scores. It sounded to me like you wanted to keep massaging the data to reduce the number of low scoring items, which seems a bit dishonest and counterproductive. If a word in WK really is unimportant to understanding a corpus, then the score should reflect that, otherwise what’s the point?

Still WIP eyecandy. Score A is mine, score B is your formula as calculated by me:

Oooh! Pretty!

Yeah, there’s a bunch of confusion here…

Here’s the data for 有る:

  • Internet: {"volume":8698,"rank":1953,"percentage":0.000034369695640634745,"percentile":0.391009658575058,"rating":"5.0"}
  • News: {"volume":912,"rank":6840,"percentage":0.000015193455510598142,"percentile":0.8826722502708435,"rating":"2.0"}
  • Novels: {"volume":3064,"rank":1746,"percentage":0.00006929067603778094,"percentile":0.7238367795944214,"rating":"3.5"}
  • Wikipedia: {"volume":11121,"rank":5838,"percentage":0.000020771049094037153,"percentile":0.7841780781745911,"rating":"3.0"}

As such, the ratings should be: 5.0, 2.0, 3.5 and 3.0, rather than the 1.0, 0.5, 2.0 and 0.5 you have displayed. There’s not calculation necessary, the numbers are just there in plaintext. If you want to reproduce the scores, you’d have to start my work over again, I’m not sure why you would want to do that (just as an intellectual exercise?). The sources are all listed on this page, I didn’t have to go searching for anything more, just poke around on the links above if you want to redo the calculations.

And yes, all of those lists have non-wk words removed. I do so after calculating the scores to limit the amount of storage and bandwidth required, and to simplify lookup and minimize performance impact of lookup for display.

Oh, speaking of bandwidth limiting, it also occurred to me that it would probably be best if we considered going with a TSV rather than a json to reduce the size of the files as much as possible. It’s probably of minimal performance impact to do the conversion, but it might genuinely reduce the size of the files by as much as half. Something to consider…

That explain things.

It’s not that I “want” to, it’s that the script I’ve written to process the lists just treats all the files in the same way, the only difference is in the first step when it’s reading a file up initially, since they’re all in different formats. I treat all of them identically after that, indexing on the character/word and calculating scores. Since not all files (obviously) have your score in them, I must calculate it myself, so I just do it for all of them.

Of course I can work around the issue by adding special handling for your files, but just running the original files would be easier. I’ll poke around the thread.

I’m not (yet) concerned about the file size. If I don’t pretty-print the file and I remove everything except the scores, it’s under 1MB, and as I mentioned before the file is only going to be downloaded once and then cached in the browser indefinitely. Changing it to a CSV (or TSV or *SV) would certainly shrink it further, but it doesn’t seem worth the cost of the additional script processing required on the file.

Wow, it is a great idea. Super useful.

I don’t know if it’s possible, but it would be cool if it could somehow check the words in Google and rank them by how many pages they appear.

I’m always impressed by how great the wk community is.

I’ve done this and I want to double check. Do these ‘internal’ names from you match up with the sources you used?

internet-words: http://corpus.leeds.ac.uk/frqc/internet-jp.num
news-words: ???
novels-words: “base_aggregates.zip” from http://ftp.monash.edu.au/pub/nihongo/00INDEX.html
wikipedia-words: Is this the wiktionary (not wikipedia) list? https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Japanese2015_10000

Looks about right. news-words is from the same place as the novels, and it’s titled “edict_dupefree_freq_distribution”. Careful with the wiktionary one to get the full 95 MB version, not the 20k abridged one.

Will do. It’ll probably be some time tomorrow before I report in, as I’ll have to duplicate some of your work – preprocessing to remove punctuation and such. Also, full disclosure, I started the early access Mass Effect demo/preview yesterday and really like it, so it’s going to be eating into my free time for a bit. Apologies in advance! :smiley:

Sorry if this has been discussed, but is there any such thing as a “conversation corpus”?

I mean… does the Twitter corpus count?

I guess it’s the closest thing, but not really. People don’t talk exactly the same way as they do in twitter as they do in conversation. Just wasn’t sure if any research had been done on conversation word frequency.

Maybe. Although, to be clear, word “corpus” implies body of written work, so if you’re googling around you might try something more like “conversation word frequency list”, because corpora are always written.

Ok, I’ve got the leeds file in and processed and the kamerman novels file is running now. I’ll compare the data I get from calculating the scores to the values you provided when that is finished. The wiktionary file I already have, it’s the one listed as “wikipedia-20150422-lemmas”, which is the filename.

I need to finish cleaning up the js and get the display in place on the lesson and review pages, as it’s still only showing on the vocab & kanji pages you see above, then it’ll be ready for a first release. I want to figure out if I can make the stars look better (the outline is a partially functional hack) and also more cleanly eliminate (but still list) corpora that don’t have the item in them, but those things probably shouldn’t delay a release.

That’s all assuming my score lines up with yours when I’m done processing this file. :wink:

I still have some discrepancies here I can’t explain but it looks like they’re related to differences in processing the raw file. For example, this is the entry from the wikipedia words file you sent me:

    "有る": {
        "volume": 11121,
        "rank": 5853,
        "percentage": 1.6612895706203e-5,
        "percentile": 0.82738345861435,
        "rating": "2.5"

When I process the raw file myself (wikipedia-20150422-lemmas.tsv) I end up with:

        "有る": {
            "rank": 5853,
            "occur": 11121,
            "freq": 1.6699924157927e-5,
            "perc": 0.988295,
            "irank": 10,
            "srank": 9

It’s obvious that we’re calculating the frequency a little differently, and the percentile very differently. This might be due to how many entries we both used from the main file. I used the first 500,000, did you use more or less? If you used significantly less, I could see your percentile and thus rating for this word falling by quite a bit.

Took a little while for some new lessons to come up (full disclosure: I completely forgot the last time or two) for me to test with. Ignoring the drop shadow white/grey boxing for a minute, as well as the scores, there’s a bit of an issue with showing the data on the lesson page. A picture is worth 1000 so…

This may be a dumb question, but some of the words in WaniKani are shown in kanji, but are often not written using kanji. To name a few: 「大した」, 「有る」, 「平仮名」, 「片仮名」, 「大体」, etc.

If the search you are doing in these corpora looks for the kanji string, some very common words will have artificially low values (such as 「有る」 not being found in Wikipedia).

You might have already considered this (in which case feel free to ignore this), but it’s worth pointing out if not.

Not a stupid question at all. It really is a problem. Unfortunately, each corpus builder will have addressed that as they saw fit, and most of them don’t say how they handled it. For the time being, I’ve just acted naively, and treated them as separate words.