Looking for a dataset of Japanese words by frequency

Hi,

I’m writing a program for myself, which would use the most frequent words (either 100 or 1000, haven’t decided yet) and then randomly sample a few each day and test me on them

I haven’t found a good dataset for this, for now I did by with scraping a Wiktionary article on most frequent words, but it’s not ideal (it works but it doesn’t show the frequencies or the source, so I can’t really trust it + it’s a bit PITA as needing any extra info on the word requires to open another page)

Does anyone have anything I could use?

Alternatively, I’ll finish scraping from Wiki and then I can share my own dataset here

Thanks

1 Like

It may depend on what your most important source/sources are reading will be (for example, newspapers or fiction or Wikipedia as examples), but if it is going to be based on fiction or books to some degree, there is a novel 5K list on Reddit that may be of help since it has frequency distributions and top 6 patterns of all kinds.

The kanji frequency shown on wkstats.com and Jisho (both also free) is based on Japanese newspapers (or Joyo/JLPT if that is of any help), but it doesn’t help with vocabulary by itself. You could still work backwards from that and the vocab top 6 from the novel 5K if you want to focus on newspapers, but it won’t be a “most frequent 100/1000” list (it would be a decent approximation though). However, at that point, you may be better served with finding a standard 1K list - an Anki deck for example - since it would be at least pretty similar as well.

Hope this helps or at least generate some food for thought!

Various sources from the Wanikani Open Framework JLPT, Joyo, and Frequency filters. I am not sure whether this data is up to date.

JLPT Source: JLPT Kanji
Joyo Source: List of jōyō kanji - Wikipedia
Total frequency Source: Kanji orderd by frequency of use - Kanjicards.org
NHK frequency Source: kanji data for NHK News Web Easy · GitHub
Other frequency sources: topokanji/data/kanji-frequency at master · scriptin/topokanji · GitHub