Added two more lists to the pile.
One is from a corpus of the entire Internet, and it is of probably the lowest quality that I would want to use at all. It’s bloated with particles, punctuation and roman characters, and there isn’t much I can do to clean it up without really throwing off the quality calculations. It was also collected in 2004. ;_; It’s also badly truncated, so it’s liable to the shortest list I put up. That said, it’s still almost 2/3 of the words on WK, and I figure its relevant, even if the ratings are a bit depressed.
The second one is based on years of 読売 and 毎日 newspaper data, and is very high quality. Probably the best I’ve seen so far. It comes from 2010, and processed nicely into almost every single word in WK. I stripped out just the particles, and the distributions look great.
Looking at this data I’m realizing I want to spend some more time processing and trying to filter out less useful entries. I don’t think I’m going to be able to do much with that internet data short of going to the source (something that is possible, given the tool that generated it is available and open source) but will require a deeper dive.
So what’s left for words: we talked about NHK Easy, but given the we have such a superior data set for two other news sources (albeit fully native/non-children’s) I’m a little unsure how much value there is given the jankiness of its provenance. The other one is the vaporware that is my anime corpus. Yeah… I definitely still intend on getting to it, but it’s really going to take some time…
I’m going to spend some more time on data cleanup today and then head off to do other things. Hopefully this is enough to go on!