I’m trying to use derive from the various Japanese vocabulary corpora to create my own core vocabulary list. For example, the Core 10K is an excellent, popular corpus, although it’s greatest weakness, in my opinion, is that it’s based on vocabulary frequency in newspaper articles. Different people have different theoretical “cores” of vocabulary from which they will get the maximum possible utility, so there is no way to make a universal perfect core deck. However, I think it is possible to roughly approximate this by combining existing corpora in ratios that correspond to the particular individual expects to derive value. For example, if someone expects to spend 60% of their time using natural speech, 15% studying for JLPT, 10% reading newspapers, 15% watching anime, then a combination of existing corpora with weighted ranks could be a decent way to approximate this. Many people accomplish this through word and sentence mining, which is also a good way to go, although can be time consuming to set up.
What corpora do you know about?
Do you know of any decent natural spoken language corpora?
Do you know of any decent JLPT corpora? (Side question: is it possible to see past JLPT exams from after 2010? Before 2010?)