Japanese Vocabulary Corpora

foodhype · April 24, 2020, 8:07pm

I’m trying to use derive from the various Japanese vocabulary corpora to create my own core vocabulary list. For example, the Core 10K is an excellent, popular corpus, although it’s greatest weakness, in my opinion, is that it’s based on vocabulary frequency in newspaper articles. Different people have different theoretical “cores” of vocabulary from which they will get the maximum possible utility, so there is no way to make a universal perfect core deck. However, I think it is possible to roughly approximate this by combining existing corpora in ratios that correspond to the particular individual expects to derive value. For example, if someone expects to spend 60% of their time using natural speech, 15% studying for JLPT, 10% reading newspapers, 15% watching anime, then a combination of existing corpora with weighted ranks could be a decent way to approximate this. Many people accomplish this through word and sentence mining, which is also a good way to go, although can be time consuming to set up.

What corpora do you know about?
Do you know of any decent natural spoken language corpora?
Do you know of any decent JLPT corpora? (Side question: is it possible to see past JLPT exams from after 2010? Before 2010?)

Belthazar · April 24, 2020, 10:30pm

No. They don’t ever release those. Any JLPT lists you find around the place are only best guesses.

foodhype · April 24, 2020, 10:35pm

Got it. Do you know what JLPT indexes are derived from?

Examples: Jisho.org includes JLPT level next to words; Torii orders vocabulary by JLPT level; Wanikani has JLPT kanji on its stats page; JLPT Resources - Free Japanese Vocabulary lists and MP3 sound files has “JLPT lists” updated 2010; etc.

I expect there is probably some index or corpus floating around, and I’m curious to know what it was derived from.

alo · April 24, 2020, 10:44pm

From the JLPT FAQ:

Why is “Test Content Specifications” no longer available after the 2010 revision of the JLPT?

We believe that the ultimate goal of studying Japanese is to use the language to communicate rather than simply memorizing vocabulary, kanji and grammar items. Based on this idea, the JLPT measures “language knowledge such as characters, vocabulary and grammar” as well as “competence to perform communicative tasks by using the language knowledge.” Therefore, we decided that publishing “Test Content Specifications” containing a list of vocabulary, kanji and grammar items was not necessarily appropriate. As information to replace “Summary of Linguistic Competence Required for Each Level” and “Composition of test items” are available. Please also refer to “Sample Questions.”

So the only officially published list was from 2010 and earlier. Most corpus’ are based off that and it’s still a good metric since much of it shouldn’t change radically.

LucasDesu · April 25, 2020, 12:53am

As for corpora, I know of only one that I use regularly. That’s the Tsukuba Web Corpus. It’s only a general corpus so you’d find information to help you acquire high-frequency collocations for exams like the JLPT.

system · April 25, 2021, 12:53am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.