Extension Suggestion: Corpora Rankings


#1

During lessons its often very difficult to tell weather or not a word is actually all that commonly used.  I was thinking: there’s a handful of corpora out there that give huge lists of words, and their frequency in various kinds of documents.  What if there was an extension that made use of those lists and provided a selection showing how common a word was in different kinds of documents.  It might look something like:

  • Corpus: Rank
  • NHK News: 115
  • Aozora Bunko: 203
  • Wikipedia: 37
  • Twitter: 293
  • Anime/Drama subs: 448
Then, at a glance, you could tell that that is a word that’s important to know, as opposed to a word that’s consistently ranked over 10k, or a word super rare on wikipedia, but in every anime ever made, and you could tailor your studies a bit to focus on words that are common and/or common to a kind of content you care about.  Mostly, because you pretty much have to study everything eventually, it would just tell me how much to worry about screwing up a particular word, and if I need to go back and figure out a new mnemonic for myself.

#2

I am… thinking about how to implement this. Sounds fun.


#3

You could do all of the processing of the corpuses (corpi?) ahead of time and just include the list of words actually in WaniKani. The main difficulty would just be managing tampermonkey api (or whatever) and obtaining and processing the corpuses. Seriously if someone else would actually write the thing, I would do the data processing and provide lists in whatever form was convenient. Probably just some huge json blobs or whatever…


#4

Actually that was my main question; where to obtain the data. Most likely I would just toss it up on an AWS host or something, as forcing it into the script itself seems really bloated, but maybe it wouldn’t be so bad; depends on how big it ends up being. The script could check online with a simple lookup, or just download the files itself the first time it’s run if it’s too large to pack into the script. The code itself seems like it would be pretty simple though, no problem there.


#5

Well, we could solve that problem when we get to it. Remember there are only something like 6k 言葉 in WK, so even with 5-10 corpuses, the entire thing will probably be below a MB. It depends on script size limits on GreasyFork… Anyway, we could figure it out if you like. If you’re down I could grab one of the corpuses and script up an example json blob (maybe a newspaper one, those are easy to find) over the weekend, if you feel like you’ve got the JS chops and are down to learn the tampermonkey/greasy fork APIs.

The hard one would be the subtitle corpuses. I’d basically have to generate one completely from scratch, but I have some reasonable ideas about how to go about that, and I’ve done scripting work with subs before.


#6

I love this idea - it’d be a great way to stop myself from careering between “80 year old man” to “normal” and back to “samurai/yakuza” every time I open my mouth.


#7

I have just learned a very important detail about this project that will require rethinking the entire premise: the plural of corpus… is corpora. That is all.


#8

Alright, so with just a little bit of digging around on the Internet during the daytime while I should be working, the stats site had a link to this guy, which contains the Wikipedia, aozora bunka (novels), twitter and news frequency lists. By the way, if you are at all concerned about size, check out the size of those blobs, and note that they are each several times larger than the set we will be using. There is also a link there to the NHK Easy News frequency list and a second take on the news frequency list. I can start with any one of those lists or we could keep digging. Personally I’m inclined to start with wikipedia, but I’m good with whatever. @alandsidel Does this look good? Are you still interested in doing this?

** EDIT ***
Damn… just realized those are kanji lists and not word lists. The hunt continues…

** EDIT **
We can just start with this. Or there’s a couple of other ones here.


#9

I can’t look right now (busy busy day @work) but I started building the scaffolding in tampermonkey last night, so yes, interested. Kanji lists are good too, as the WK class indicates if it’s vocab/kanji/radical/whatever, so there’s no reason to not support as many as possible for this. If we have to only support one, vocab is obviously the most useful. I don’t need you to do any blob manipulation for me, but finding the resources as you’re doing is a definite help! Getting the raw frequency data is the most important thing, the rest is “easy.”


#10

Alright, cool. LMK if there’s anything else I can do. I figured pairing down and normalizing the data would be useful, especially for storing the script with its data, but if you’d prefer to consume it whole, there it is, and I can keep digging for more. Although even if you don’t intend to normalize the data, I would recommend against having the script download/cache it from their current homes as they may disappear eventually. An S3 bucket should be easy though. Anyway, thanks for taking this on!

If you want to proceed like this though, then I’ll start working on the subtitles this weekend and try to build the TV/Anime/Drama/movies frequency list.


#11

One more for the bonfire: a frequency list of lemmas for… what looks like all of the internet with .jp domain names. Holy… That oughta be useful.


#12

I’m just holding off on saying to condense the data or not right now since I haven’t quite determined just what I’m going to need/want and how I want to access it. I’ve spent a little time hooking the script in and doing the scaffolding to determine if it’s showing a radical, kanji, or vocab word. I don’t think I can cleanly hook into the ‘next item’ event, so I’ll probably just be polling to watch for a change.

That said, my intention is to show the information as a sort of overlay during reviews, inside the same colored box where the item itself is being shown. Is this what you envisioned as well, or are we on different pages? I’m also wondering if it should always be shown, or maybe only when the mouse is hovered over the item, or something else entirely Any opinion on the UX elements like this?

I haven’t yet explored the links you sent, I’ll do that as soon as I’m officially out of the “office” :wink:


#13

Huh, yeah I was thinking of showing the frequencies during lessons, and on the vocab’s page, rather than during review at all. I was thinking of the information as reference level, a neat little FYI that might guide one’s study, rather than a hint during the reviews. IDK, maybe it would work as a part of the review… now that I think about it, I would also have it in the drop down boxes from the “more info” box. Then if you screwed it up and dropped down the info box you could be like “cool, that wasn’t even important anyway” or “Oh crap, that one’s really important too!”

You might check out the phonetic semantic composition script, that’s one I use and I envisioned this being a lot like that. Just hints for remembering and studying.

Of course you’re free to write it as you please; you’re the one writing it! Thanks for working with me on it!


#14

getboth.jpg. I’ll certainly add it to the vocab page itself, that’s terribly easy. After that, the review and lesson pages look to be pretty much the same deal, so if I do one, there’s no good reason to not do both. I’ll do the vocab one first as v1 since it’ll be by far the easiest and lets me quickly jump to actually working with the data. The workin’ day is done, so I’m going to take a peek at some of the data now and get a feel for the format and what I may want to do to package it.


#15

Something like?


#16

Hmm, what about just having another heading, like “Reading” and “Context Sentences” in that screenshot?


#17

Easy enough, probably easier than what I’ve done already. Top or bottom?


#18

Hmmm… you’re call, but if it were me, I’d do below reading, and above example sentences.


#19

Yeah I can do that, but it’s kind of “nasty” because I have to just put it at a certain numbered position in the “list”. The list items don’t have IDs, so the best i can do other than making it first or last is to put it in the Nth spot, with the understanding that if the page headings get reordered, it’ll still go in the Nth spot rather than moving with it’s neighbors.

I could search for the actual text then back out of the tree, but that’s the kind of ugly hack I hate doing when I have a choice, so far now, just making it the 2nd item should be good enough. This code works exactly the same on the Kanji page too, but the headings there are different, so I wonder if it should also be 2nd there (between radicals & readings) or probably after readings?


#20

Awesome! Yeah, my brain goes to goes to Scala’s collection predicate based finders, but obviously that’s not a thing in JS. :-p No worries, and if it’s easier just leave it at the top. I mean, in the end the best solution would involve being able to move it where you want. In the meantime, I believe those sections are reliably positioned, so I would just say [2] and leave it. Sounds good for the Kanji stuff!