Anki Word Frequency Inserter: Learn most common words first

@Kumirei @NicoleIsEnough You can now set the expression field name and frequency field name:


(UI WIP)

If you have the time to test it, that would be appreciated! (sorry for the delay)

You can use URL parameters to not always have to change the field names again, e.g.:
file:///C:/AnkiFrequencyInserter/index.html?expressionFieldName=Expression&freqFieldName=FreqInnocent

I tested it with these field names in a fresh deck+profile, and it worked.
(I also tested it with the state of my main deck from before I first used the inserter, doing 1029 changes, apparently successfully)

Note that you currently have to use the offline version, click Code → Download ZIP here:
https://github.com/sschmidTU/anki-frequency-inserter
(then just open the index.html)

The online version currently works in no browser for me, including Chrome, Edge and Firefox. It started a few months ago with Chrome’s changes to Private Network access. At that point, it still worked in Edge and Firefox. Details see here:
https://github.com/sschmidTU/anki-frequency-inserter#troubleshooting
(it might be that you can use the online version e.g. by launching Chrome with some permission flags, not sure)

Next I’ll look at the alternative dictionaries with relative frequency you mentioned, Kumirei. Would be great to have the option to use these frequencies instead. (Or you could simply use/insert both)

1 Like

Works great! I can still use the online version in Firefox 99.0 on Linux Mint

1 Like

@Kumirei @NicoleIsEnough Frequency Inserter using BCCWJ corpus (Contemporary Written Japanese, relative frequency) is done:

(separate URL for now)

Note that the first visit will take ~5.8MB to download, as the BCCWJ corpus is 5MB zipped, 19MB uncompressed

For me it only works offline, not online.
For the offline version, just download the repository (Code → Download ZIP) and open index_BCCWJ.html.

Example:

BCCWJ uses relative frequency, so 2938 = 2938th most common word


Observations

I tested it for my test deck, and there were 935 frequencies found in the BCCWJ corpus, while there were 1045 in the InnocentCorpus:

I also applied this to my main deck already, no issues as far as I can see (1174 frequencies total, while I have 1340 from the InnocentCorpus).

In Anki we can see that the most common words are mostly correlated between the corpuses, though there are some differences:
(note that InnocentCorpus frequency is absolute, so bigger = more frequent, while BCCWJ = nth most common word)


most common in BCCWJ:

It would be interesting to see the largest differences between the corpuses. For that we could convert InnocentCorpus into relative frequency, then sort by largest difference to BCCWJ (via javascript).


Oddities
There are a few odd things about the BCCWJ corpus:

  • Some common words are not in the corpus:
    • 見知らぬ (InnocentCorpus 5147, “common” on jisho.org), not in any form given on jisho.org
    • 真に受ける (InnocentCorpus 829)
  • Some common words are only given in a less common form according to jisho.org:
    • 日付 (N3, InnocentCorpus 4096, “common” on jisho.org) → 日付け (4613)
    • 見つかる (N4, also not in InnocentCorpus), → 見付かる (1117)
    • お供 (N1, InnocentCorpus 3418) → 御供 (13786)
    • etc etc
    • jisho.org may not be accurate as to which form is the most common, but it’s still odd it’s different to this corpus. Maybe InnocentCorpus has taken the same forms as jisho.org
  • some relatively common words have a high rank:
    • 有様 is rank 536048 (least common / highest rank in the corpus), N1+common on jisho.org, occurs 9276 times in InnocentCorpus
    • 白状 is rank 160807, occurs 4611 times in InnocentCorpus
    • 人性 is rank 135068, and a WK level 14 word. It doesn’t seem like the 135068th most common word. (Though frequency 77 in InnocentCorpus)
    • etc

So, if at any point you have another corpus to recommend, feel free to tell me (=

Though to me it seems like combining InnocentCorpus and BCCWJ in my Anki cards is a nice way to catch these outliers for now, to get a better picture, and to try to learn the most common words from both corpuses.


Technical note: I’d like to unify the code and html for both corpuses, but I didn’t want the user to need to download both corpuses (BCCWJ is 5MB zipped, Innocent 1.7MB), and on the fly loading seemed like a bit of a hassle, probably need to require when the user clicks a radio box or something?

edit: the main code (frequencyInserter.js) is now unified.

2 Likes

I love it, thank you!

1 Like

Great, happy that you like it! :slight_smile:
I have to say I’m growing to like the BCCWJ frequencies quite a lot! It’s nice to see “xth most common word”. Even when it has some too high or missing frequencies compared to InnocentCorpus.

I just added hiragana to katakana conversion and vice versa (using wanakana) in v1.2.4 to try to find more frequencies. It found 4 (BCCWJ) or 5 (InnocentCorpus) more frequencies for me. E.g. BCCWJ has にこにこ but not ニコニコ, and vice versa with InnocentCorpus.

image

Also, there’s a new frequency search field near the bottom.
image

2 Likes

@Kumirei I added an option to look up the frequency from the Reading field, if no frequency was found. This finds a few more valid frequencies like for といった (と言った wasn’t found).

There’s just one catch: This might in a few cases give a slightly misleading frequency, e.g. for 盗る, which is a rather uncommon way to write 取る, and has a nuance of stealing. とる is the 2396th most common word in BCCWJ, but putting that on 盗る is a bit misleading.


(InnocentCorpus finds 40 new frequencies for me)

This is now available as a new option (checkbox), but disabled by default for now.

(you can also change the name of the reading field searched for with e.g. ankiInserter.ankiReadingFieldName = 'reading' in the console)

What do you think is the best way to handle these cases? One way would be to add info to the note that the frequency came from the reading field, e.g. via tag frequencyIsFromReading or something. Though it could get complicated if you have frequencies from multiple corpuses, then maybe you need a tag for each corpus. Or maybe we put that info (and more) into a new FrequencyMetaInfo field in the note.

For now I recommend first adding all frequencies without the option, then manually checking the new frequencies with it.

By the way, I got rid of the ‘Apply options’ button (even though I kinda liked it),
oninput() is a powerful thing :wink:

ps: feel free to share screenshots! I’m curious ^^

1 Like

That’s an interesting idea. However, I think I prefer getting the frequency of the form it’s in

If I was going to use this (which I might, in the future) I think I would probably have a separate field for it, and then render it conditionally on my cards with a note saying it’s from the reading

Either way when I was inserting the BCCWJ frequencies it found 2505 words without frequencies, and with the reading feature enabled that only dropped to 2411, so at least in my case it’s not a lot of extra frequencies


Minor bug report:

You may want to wrap the query in double quotes to support field names with spaces in them. I named a field BCCWJ Frequency and it worked when I wrapped the query in quotes

ankiInserter.ankiSearchQuery = '"BCCWJ Frequency:*"' 

This is pretty neat, though

3 Likes

Ah, good idea, thanks! Fixed. I hadn’t thought about that at all ^^

Fair enough, that’s definitely more precise, though personally I just want to know how common an expression is regardless of form, so for と言った I prefer to see the frequency of といった rather than none at all.
Good that this is an option, so everyone can choose :slight_smile:

Ah, that’s a good idea! I’ll think about that. Maybe I’ll implement it.

1 Like

Added a “remove invalid frequencies” option, just in case you ever used the “try reading field” option by accident @Kumirei :wink:
(or in case I add the option to save the frequencies found by reading field into a different field)

(the new frequency field value will be an empty string of course)

2 Likes

Oh, that would be nice for if you only have one field for frequency (I have one for each type of frequency) and you switch to a different source

1 Like

Oh, that should already have happened in the “notes that will be changed” section.
I’ve fixed that now and added a hidden option to not change existing frequencies (ankiInserter.updateIncorrectFrequencies = false). I’ll probably add a UI option too.

This is now the result if I update from BCCWJ to Innocent frequencies on a small 8 card test deck:

This is the result if I give “FrequencyInnocent” as the field name for BCCWJ frequencies in my main deck:

Due to this change, I also get a few new changes when just checking with my main deck (for BCCWJ):
image
(previous frequency for だいぶ came from 大分)

These are mostly “correct”, though BCCWJ is being weird by not having いよいよ at all (which is pretty common), only having the kanji form 愈, which according to jisho.org is usually written using kana alone. same for それとも: only has 其れとも.

Things could actually get more complicated if you want to differentiate for all cases where you have different frequencies found for the hiragana/katakana/kanji version of a word. だいぶ is actually not in BCCWJ, only ダイブ, which is much less common than 大分.
Also: Do you want an option to not lookup the katakana vs hiragana version or vice versa either? This might remove some slightly imprecise findings, though it would also e.g. give no frequency for ニコニコ in BCCWJ, which has only にこにこ (exactly opposite to InnocentCorpus).

1 Like

Added new checkbox.

image

Also, I fixed furigana stripping: the regex was greedy instead of lazy, so 端微塵ぱみじん was stripped to 木 instead of 木っ端微塵. Oops :wink:

(this is only relevant if you have a Furigana field and ankiInserter.tryFuriganaFieldAsKey === true. Maybe I should add a checkbox for that too ^^)


After some thinking, I’ll add an option to use the most common frequency among all forms of the word enabled. So for それとも, it’ll take the BCCWJ frequency for 其れとも (995) instead of ソレトモ (536048) or 0.
I’ll use that, because the BCCWJ data I have is clearly being weird here.

1 Like

New option is here. Hopefully the last one for today ^^


these are the default options now.

Kumi, you can add the url parameter &takeMostFrequent=0.

(again, I believe this BCCWJ data is wrong by having only 其れとも (995), instead of それとも, see above)

I think these changes are legit :wink: (except それとも)
image

New feature: Show words missing in your Anki cards. (bottom of page)

To be honest, I find it a bit disappointing, because a lot of these words here are pretty boring/obvious and I wouldn’t want them as Anki cards (e.g. スタート or ははは).
I wanted to neatly find a few holes in my vocabulary here.

I probably still can, just need to ignore some words. Or maybe I’ll just add and suspend them? … Or I’ll add a button to ignore these words, which gets saved in local storage… ^^

I thought about not putting this on the main page, but I guess it doesn’t hurt much (WK vocab list is a 26KB download). And I did already find a few neat words.
Maybe this could get more interesting with some more filters like only N2 words or something.

2 Likes

I have not been able to connect with either Chrome or Firefox, online or offline. It’s especially weird since my Yomichan can connect to Anki just fine. What can I do to debug and find the root cause of the problem?

1 Like

That’s strange! Have you downloaded the repository? Click here → Code → Download, then open index.html or index_BCCWJ.html

Can you open the console (Ctrl+Shift+I) and show the full error log when you click on Connect?

Also, what’s your Anki version? I hope 2.1.4x+. Also, your AnkiConnect (which it seems hard to figure out the exact version of) needs to support the API version 6, though that’s already many months old.
Maybe you changed the port for your AnkiConnect?
Can you share the contents of your config? Anki → Tools → Addons → AnkiConnect → Config

I downloaded and set up Anki a couple weeks ago so the version is recent.

When trying to run online in Firefox I get these errors:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://localhost:8765/. (Reason: CORS header ‘Access-Control-Allow-Origin’ does not match ‘http://localhost’).

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://localhost:8765/. (Reason: CORS request did not succeed). Status code: (null).

Even though it was in the instuctions for the offline case I added the “null” to the AnkiConnect config but it did not help. This is my current config:


{
    "apiKey": null,
    "apiLogPath": null,
    "ignoreOriginList": [],
    "webBindAddress": "0.0.0.0",
    "webBindPort": 8765,
    "webCorsOriginList": [
        "http://localhost",
        "null"
    ]
}
1 Like

Have you tried setting “webBindAddress” to “127.0.0.1”? That’s my setting, not sure if I changed it.

CORS is generally an issue with this that I know about, but it should be circumventable at least in the offline version.
I also just tried it with Firefox, and it works for me. Latest Firefox version, 102.0.1.
Of course, it would be great if it works for everyone.

It might be that you have some strict security settings in Firefox that prevent this.

The issue is that this website tries to open a connection to AnkiConnect, which is theoretically a different address in your local network, and modern browser standards don’t like that, because in general, you could theoretically abuse that. But of course, with AnkiConnect, the worst thing that could happen is changing your Anki cards.

The ideal solution would probably be turning this into an Anki addon, then we have easy and full control. But that would be a big rewrite, and I liked the flexibility of the browser and making my own interface.

By the way, for Kumirei, it even works online in Firefox (on Linux).

The webBindAddress change fixed the offline version, thanks.

However I’ve noticed that one of my cards for 残らず was given a frequency of 3 using the Innocent Corpus, which doesn’t seem right?

firefox_RfuiFhDXXf

1 Like

That’s what InnocentCorpus says, also when I use yomichan (which uses/popularized InnocentCorpus).
image

Yes, it’s strange, it should be much more common. But both corpuses have weird spots like these. I’m guessing it depends on how they decided what word forms to include in the corpus or to unify, and they might have given the occurences to 残る instead (which has ~91000 in InnocentCorpus).
残らず is basically just a conjugation of that, and the corpus doesn’t add or fully consider every conjugation of every word, that would be impractical.

I’ll add the part about webBindAddress to the instructions, thanks!

Also, try the BCCWJ corpus, which uses relative frequency (100 = 100th most common word). (index_BCCWJ.html) - I’m using it to prioritize the most common 15k or so words currently.