Extension Suggestion: Corpora Rankings

Not dead!

Life is going to be extremely complicated through April, so I wouldn’t expect any more progress for a bit after this, but I have done some work this morning that I wanted to share with y’all.

I combined all the vocab data that I want to use (in what I guess will be my version of this extension) and I’ve got it to a permanent public host. In case anyone wants at it, it’s here. I’ll be updating it in that location as I move forward, but it shouldn’t be any problem to take on the bandwidth costs from that, as it’s S3, so it’s very cheap!

Also, I’ve been playing with the data. I had this curiosity about what the data would show if I took one of the corpora and built a graph demonstrating the distribution of the ratings of all of the words in a particular WK level, and then stringing them all together. So here’s what it looks like:

It’s a bit hard to read, but you can generally see that the higher rating words become fewer and fewer as you go up in levels. It seems like there is a real turning point at around level 20-25 where it turns from being mostly 3 star and greater to being mostly less.

Let me know if you find this stuff interesting and I can post other corpora’s data.



Good work guys! がんばって

I had some inspiration on how to approach the issue of the Anime corpus. You guys may have already come up with this, but I offer it here for your consideration:

Subtitle files

Such files tend to be available for free from various, if admittedly sketchy, 3rd parties, and could be used to build a solid frequency list with minimal parsing.

Really looking forward to the finished product!

Yup. That was the plan. If you look at the original post I wrote “Anime/Drama subs”.

I only know of the one site… kitunekko.net, but if you have more suggestions about where to get reliably good subs I would be very grateful!

