Japanese Vocabulary Corpora

I’m trying to use derive from the various Japanese vocabulary corpora to create my own core vocabulary list. For example, the Core 10K is an excellent, popular corpus, although it’s greatest weakness, in my opinion, is that it’s based on vocabulary frequency in newspaper articles. Different people have different theoretical “cores” of vocabulary from which they will get the maximum possible utility, so there is no way to make a universal perfect core deck. However, I think it is possible to roughly approximate this by combining existing corpora in ratios that correspond to the particular individual expects to derive value. For example, if someone expects to spend 60% of their time using natural speech, 15% studying for JLPT, 10% reading newspapers, 15% watching anime, then a combination of existing corpora with weighted ranks could be a decent way to approximate this. Many people accomplish this through word and sentence mining, which is also a good way to go, although can be time consuming to set up.

What corpora do you know about?
Do you know of any decent natural spoken language corpora?
Do you know of any decent JLPT corpora? (Side question: is it possible to see past JLPT exams from after 2010? Before 2010?)

No. They don’t ever release those. Any JLPT lists you find around the place are only best guesses.

2 Likes

Got it. Do you know what JLPT indexes are derived from?

Examples: Jisho.org includes JLPT level next to words; Torii orders vocabulary by JLPT level; Wanikani has JLPT kanji on its stats page; http://www.tanos.co.uk/jlpt/ has “JLPT lists” updated 2010; etc.

I expect there is probably some index or corpus floating around, and I’m curious to know what it was derived from.

From the JLPT FAQ:

So the only officially published list was from 2010 and earlier. Most corpus’ are based off that and it’s still a good metric since much of it shouldn’t change radically.

2 Likes

As for corpora, I know of only one that I use regularly. That’s the Tsukuba Web Corpus. It’s only a general corpus so you’d find information to help you acquire high-frequency collocations for exams like the JLPT.