Sorting WaniKani by usage frequency


#1

While working on KameSame, I’ve become really interested in better sorting a “re-study” experience for the WK curriculum of vocab & kanji. One thing I’ve learned at level ~53 is that the vocab WK introduces at higher levels is a bit stilted/academic/unusual to my conversation partners, which I think is a function of the fact that WK vocab is always demanded by a kanji, and some joyo kanji aren’t very 常用 at all.

As a result, I’ve started looking into ways to establish a sort based on practical usage. Unfortunately, every social media & search API has really restricted access to aggregate searching to find this out. While I look into academic corpuses like ninjal’s, I decided to pay for the Azure/Bing search API and simply ask the number of search results for each term in WK.

Here are the results for you to take a look at:

https://gist.github.com/searls/b6aad683d33096bc8e24ca7b8c929236

What do you think of these? There are some very obvious false positives, especially near the top, and some false negatives (cases where WK uses kanji in words that Japanese almost only use hiragana, for instance).


Introducing KameSame - a new reverse-Wanikani web app!
#2

You could look at the free BCCWJ word frequency list for comparison [here]. It separates the occurrence count by source type, so you can exclude sources like legal documents and textbooks if you want.

I assume you’re familiar with the BCCWJ since you know about ninjal.