I have downloaded the two new files and am incorporating them into the combined file now. I do have one question though, did you eliminate the non-WK words from these as well? If so, what is the source for these two? Removing the non-WK words before calculating the ranking is like eliminating numbers outside of a certain range before calculating an average – it really invalidates the calculation. In this case, it will (most likely) hurt the ranking of every item in the list.
The combined file I’ve generated has two scores in it, and can easily be extended to more or reduced to just one. The two scores in it now are my 11-step percentile based calculation, and the formula you gave me approximating your manual division.
When I run my test lookup against all six of the corpora I have now, this is the result. Scores are from -1 (not present) to 10 (5 stars). The first value is “my” score, the second is “your” score – from your formula, not the one you provided, since that only exists in the files you provided and not the ones I generated myself.
token file 1 file 2 file 3 file 4 file 5 file 6 file 7 file 8
上 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
下 10 10 10 7 10 9 10 10 10 8 10 9 10 10 10 9
大人 -1 -1 -1 10 10 9 4 10 7 9 5 8 3
一人 -1 -1 -1 9 5 -1 10 6 4 0 0 0
煩 9 5 1 0 2 0 9 4 -1 -1 -1 -1
蛮 9 5 -1 3 0 10 7 -1 -1 -1 -1
file 1 == wikipedia-kanji
file 2 == kanji_easy
file 3 == nozaki
file 4 == wikipedia-20150422-lemmas
file 5 == stella-internet-words
file 6 == stella-news-words
file 7 == stella-novels-words
file 8 == stella-wikipedia-words
The score I’m calculating from you looks a bit off here. I haven’t compared it with what’s in your file, but thinking back to the confusion we had, it’s possible you meant something other than percentile in the formula? If so I’ll re-run this. I mention it because in our discussion your score was typically turning out higher values than mine (as you were concerned about too many 0-2 star items) but the reverse seems to be true here – my scores are consistently higher when the item exists in the corpus.
Of course, if all of your files are sent to me with the non-WK words already removed, that would skew the results in exactly this way.
I absolutely do not care about this. The script will be released under the GPL. You and anyone else are free to do anything you want with it (except release it under a different license, of course).
I can accept that, I just don’t like the unscientific nature of the arbitrary divisions, and I also don’t really understand what you’re after (your “requirements”) with the scores. It sounded to me like you wanted to keep massaging the data to reduce the number of low scoring items, which seems a bit dishonest and counterproductive. If a word in WK really is unimportant to understanding a corpus, then the score should reflect that, otherwise what’s the point?
Still WIP eyecandy. Score A is mine, score B is your formula as calculated by me: