Anki Word Frequency Inserter: Learn most common words first

@Kumirei @NicoleIsEnough Frequency Inserter using BCCWJ corpus (Contemporary Written Japanese, relative frequency) is done:

(separate URL for now)

Note that the first visit will take ~5.8MB to download, as the BCCWJ corpus is 5MB zipped, 19MB uncompressed

For me it only works offline, not online.
For the offline version, just download the repository (Code → Download ZIP) and open index_BCCWJ.html.

Example:

BCCWJ uses relative frequency, so 2938 = 2938th most common word


Observations

I tested it for my test deck, and there were 935 frequencies found in the BCCWJ corpus, while there were 1045 in the InnocentCorpus:

I also applied this to my main deck already, no issues as far as I can see (1174 frequencies total, while I have 1340 from the InnocentCorpus).

In Anki we can see that the most common words are mostly correlated between the corpuses, though there are some differences:
(note that InnocentCorpus frequency is absolute, so bigger = more frequent, while BCCWJ = nth most common word)


most common in BCCWJ:

It would be interesting to see the largest differences between the corpuses. For that we could convert InnocentCorpus into relative frequency, then sort by largest difference to BCCWJ (via javascript).


Oddities
There are a few odd things about the BCCWJ corpus:

  • Some common words are not in the corpus:
    • 見知らぬ (InnocentCorpus 5147, “common” on jisho.org), not in any form given on jisho.org
    • 真に受ける (InnocentCorpus 829)
  • Some common words are only given in a less common form according to jisho.org:
    • 日付 (N3, InnocentCorpus 4096, “common” on jisho.org) → 日付け (4613)
    • 見つかる (N4, also not in InnocentCorpus), → 見付かる (1117)
    • お供 (N1, InnocentCorpus 3418) → 御供 (13786)
    • etc etc
    • jisho.org may not be accurate as to which form is the most common, but it’s still odd it’s different to this corpus. Maybe InnocentCorpus has taken the same forms as jisho.org
  • some relatively common words have a high rank:
    • 有様 is rank 536048 (least common / highest rank in the corpus), N1+common on jisho.org, occurs 9276 times in InnocentCorpus
    • 白状 is rank 160807, occurs 4611 times in InnocentCorpus
    • 人性 is rank 135068, and a WK level 14 word. It doesn’t seem like the 135068th most common word. (Though frequency 77 in InnocentCorpus)
    • etc

So, if at any point you have another corpus to recommend, feel free to tell me (=

Though to me it seems like combining InnocentCorpus and BCCWJ in my Anki cards is a nice way to catch these outliers for now, to get a better picture, and to try to learn the most common words from both corpuses.


Technical note: I’d like to unify the code and html for both corpuses, but I didn’t want the user to need to download both corpuses (BCCWJ is 5MB zipped, Innocent 1.7MB), and on the fly loading seemed like a bit of a hassle, probably need to require when the user clicks a radio box or something?

edit: the main code (frequencyInserter.js) is now unified.

2 Likes