That’s good feedback I guess!
With everything I have in place to make it easy to find words to add to Anki, somehow I’ve managed to not add very many words to Anki lately…
Rather than focus my efforts on adding more cards for words to learn, I’ve spent my time putting together some spreadsheet formulas to measure my progress.
Of course, I’m at the point where these numbers will change very slowly, as I’ve learned most of the highest of the high-frequency words/kanji already. Thus the utility of these is minimal…
These numbers exclude words that appear only one or two times, to try and reduce the impact of instances of OCR errors. (I should exclude that logic from Yoru Cafe and Final Fantasy VI, as it doesn’t apply there.)
I feel the numbers for Sailormoon properly reflect how time-consuming it was for me to get through reading the series…
Yotsuba’s “overall” number is probably low due to a lot of words that appear without kanji that I haven’t added to my “known words” list yet.
Final Fantasy VI probably has a low “overall” due to some names I haven’t removed from the spreadsheet yet. Likely there are also words that appear without kanji that I haven’t added to my known words list. But overall this game just uses some different vocabulary from the manga I read.
These numbers also exclude kanji that appear below a threshold (the “min” column), for the same reason.
Missing from this list: Final Fantasy VI, which has the lowest percentages (61.1% and 80.0%).
IMHO, setting up all of these systems and analytics is real work that can benefit many. You are sacrificing study time for a greater good. ありがとうございます
TBH, The Yotsuba numbers made me laugh and give a vindicated “Aha!”, because I was dying trying to read it since I was better at recognizing kanji as words than strings of kana (particularly with alterations due to casual speech and childish mispronunciations).
Today, I was a bit curious, if someone wanted to specifically pre-learn x% of kanji found in よつばと！ volumes 1 through 15 before they started reading the series, how many kanji would they need to learn?
It’s a silly question, of course. The series has furigana. Start reading it as soon as possible, regardless of acquired kanji, if you’re interested.
But I was still curious.
The following are the statistics I compiled. They tell how many kanji you need to learn in order to recognize x% of the overall kanji in the series.
This is considering only kanji and not vocabulary.
I did not exclude kanji used in character names, although I think it’s worthwhile to do so when learning from a frequency list. You’ll likely recognize the names in context (and this series has furigana), so no need to try learning them when you’re not encountering vocabulary words that use those kanji.
- To reach about 60%, complete level 9. This covers 310 kanji, 290 from the series.
- To reach about 70%, complete level 12. This covers 425 kanji, 393 from the series.
- To reach about 80%, complete level 18. This covers 628 kanji, 555 from the series.
- To reach about 90%, complete level 30. This covers 1,023 kanji, 816 from the series.
Following the 6th edition of Remembering the Kanji, learning the kanji in the book’s order without skipping any, goes as follows:
- To reach about 60%, complete 1,075 kanji, including 599 from the series.
- To reach about 70%, complete 1,240 kanji, including 686 from the series.
- To reach about 80%, complete 1,582 kanji, including 853 from the series.
- To reach about 90%, complete 1,882 kanji, including 989 from the series.
If you followed RTK in order, but only learned the kanji that appear in よつばと！, you would need the latter number of kanji on each line (599 kanji for 60%, etc).
The advantages of WaniKani and RTK are that the kanji you learn builds up off of what you’ve previously learned. Some more common kanji contain elements matching less common kanji.
If you are learning kanji based on frequency alone, you often miss out on those less common kanji.
For example, 違 is very common in the series (tied with 姉 for 35th on the list), but you won’t see ⻌ on its own, and 韋 is unsurprisingly not to be found in the series.
That said, if you were targeting the kanji you learn based on what appears most frequently in よつばと！:
- To reach about 60%, you need to learn 130 kanji.
- To reach about 70%, you need to learn 196 kanji.
- To reach about 80%, you need to learn 304 kanji.
- To reach about 90%, you need to learn 492 kanji.
Next I wondered, what if you started with WaniKani to help get a foundation of learning kanji, then switched over to learning kanji by frequency of use in the series?
If you complete WaniKani through level 10, you learn 348 kanji, of which 323 appear in よつばと！. If you started learning by frequency at this point, how many kanji would be left to learn?
- To reach about 60%, you’re already there.
- To reach about 70%, you need to learn another 13 kanji.
- 俺 違 好 香 供 丈 夫 恵 那 帰 変 緒 待
- 香, 恵, and 那 appear in names.
- 俺 違 好 香 供 丈 夫 恵 那 帰 変 緒 待
- To reach about 80%, you need to learn another 66 kanji.
- To reach about 90%, you need to learn another 206 kanji.
If you instead complete WaniKani through level 20, you learn 694 kanji, of which 600 appear in the series. How many kanji would be left to learn if you went by frequency?
- To reach about 80%, you’re already there.
- To reach about 90%, you need to learn another 40 kanji.
- To reach about 95%, you need to learn another 139 kanji.
With a manga (with furigana!) up next in the Intermediate Book Club, I’m planning on giving it a try.
I never heard of Spy x Family before, so I don’t really know anything about it. The sample pages seemed simple enough, but I’m figuring if it was bumped up from Beginner Book Club, there may be some difficulty I didn’t see yet.
I wonder if it uses any more complex grammar for me to learn, or if it’s more text density and reading speed that’ll have it as an intermediate read. Maybe it’s sort of between beginner and intermediate in difficulty.
I picked up the first volume and, of course, I had to see how it looks compared with my learned kanji and vocabulary.
First up is kanji:
I recently adjusted my percentages to factor in only kanji that show up an average of once per volume (or twice for larger volumes).
As a series runs longer, the variety of words (and thus kanji) used increases. For a series such as My Love Story at 13 volumes, those random words (kanji) need to show up at least 13 times to be counted in my percentages. But when I have only one volume, such as is the case for Spy x Family, rarely used kanji that appear in volume one then hardly ever again in the series weigh the unique percentage down a bit low. If I like the volume and keep reading, the unique percentage will steadily increase as I increase that threshold.
Side-note: I don’t use a threshold of 1 for OCR’d manga because there may be some mis-read kanji, and this helps weed those out.
Among manga I’ve read or am reading, Spy x Family has the lowest percentage of unknown kanji for me. Looks like I’ll be taking this opportunity to learn a few more kanji!
Secondly is vocabulary:
Here, the overall percentage is similarly lower than other manga I’ve read or am reading.
This utilizes the same threshold system, where the longer a series runs the more times a word needs to appear to be counted. Words that appear a few times in volume one, but not in later volumes, are included less in the percentages over time. (That’s also why my Yotsuba percentages are much higher now, as the threshold in my prior post’s screenshot was 3 and now it’s 15 for that series.)
One of the main characters has a tendency of using very complex words and sentences, I posted about it here in the read every day thread
haven’t read it but watching the anime … it’s awesome and adorable all at the same time
IMO it got bumped to intermediate because of the amount of text more than the difficulty (based on what happened with deathnote)… but even if it ends up in intermediate book club and it’s on the easier side that’s still ok
I did get the impression that might be it.
And if that’s the case, I’ve (unintentionally) been doing endurance training:
Looks like a novel disguised as manga
That’s how I felt with the first volume of ハヤテのごとく！.
At least for 名探偵コナン there, these heavy pages are mostly at the end of a multi-chapter case, where Conan is explaining how a mysterious crime was carried out.
If only 諄う existed as a verb form.
(Yes, I had to look up 諄い. I don’t know that one yet.)
The author of the manga-ocr tool I use recently released another project, Mokuro.
This one streamlines text segmentation into the process and outputs an HTML file. The idea here is you get a web page to view the manga pages in, but also you can hover over any text in the manga image to display a text copy. This text copy can then be used to do a dictionary lookup via a browser extension, such as Yomichan.
Since I use Migaku’s browser extension, I can also press ` to parse the text the same as I would a web page.
The toolbar at the top-left allows navigating between pages, and two pages are shown at a time (mimicking the appearance of reading a physical comic).
Here, I’ve zoomed in on a single panel. Zooming and panning are easy and responsive using the mouse. (I haven’t checked into keyboard controls yet.)
The blue icon is from the Migaku browser extension.
Hovering over text in the image gives this selectable, parsable text view.
Pressing Shift while hovering over a word with the Migaku browser extension gives a dictionary entry. (Likewise, Yomichan and other extensions can be used, as it’s just text in a browser window.)
Pressing ` allows the Migaku browser extension to parse the displayed text. The extension isn’t able to simply parse all the text in the whole file in one go, as text is normally not displayed and the extension probably doesn’t parse non-visible text.
After selecting the icon to open the card creator, I can copy and paste the text from the word balloon into the card creator, then take a screenshot of the panel to paste in.
This is going to completely change my manga-reading experience for manga without furigana.
Processing a 260-page volume, running on CPUs, took about 51 minutes to complete.
After long gaps between posts, it seems I’m doing a series of posts in rapid succession.
It looks like Mokuro will completely replace my prior method for text segmentation and OCR. It uses the same software but streamlines the process.
The key point is that it creates a JSON file for each page.
The implementation allows Mokuro to pick up where it left off if it crashes or is abruptly stopped, but it also allows regenerating the HTML reader file if there’s an updated version of Mokuro without having to re-extract from the images.
But for me, it gives me something to hook into to utilize the data the same way as my current process.
Thus, my new process for extracting text from manga is:
1) Run Mokuro on a folder with one or more volumes in it:
2) Run a script to extract kanji + frequency from a volume. I can take the output from multiple volumes of one series and drop them into Google Sheets and use UNIQ and SUMIF for overall counts.
I’ll probably adjust the output of this to include which volume it’s from, so when checking the frequency of a kanji, I can see if it’s used heavily in a single volume.
3) Extract text for text search.
This is for when I want to see where a kanji or vocabulary word shows up in manga I’ve read. Having the image name on the same line as the text helps me locate the manga page to open.
I still need to figure out how I want to implement including the volume number as well. Should be simple enough to parse for it to include.
4) Extract text for morphological analysis.
This is basically the same as the last item, except without the image names. This one then gets run through Juman++:
And I plan to write a parser for this to more easily generate a vocabulary frequency list that I can drop into Google Sheets.
Somehow being sick with a cold means I’m much more productive at getting monotonous projects done.
Because I’m no good at marketing, here is a Venn diagram related to the work I did today:
Disclaimer: Circles are not to scale. The yellow circle should be massively larger and the blue circle should be quite a bit smaller.
My study log being in Campfire means that blue circle should be even smaller.
What was today’s project?
It’s those spreadsheets I show screenshots of showing my progress in learning kanji and vocabulary words from manga.
- Kanji Master Spreadsheet
- Kanji Series Spreadsheet
- Vocabulary Master Spreadsheet
- Vocabulary Series Spreadsheet
I’m not certain offhand if they have the correct permissions.
The basic idea is that you copy a Master spreadsheet to your Google Docs, then copy tabs from the Series spreadsheet. From there, populate the list of kanji/vocabulary you know to get progress stats and see the most frequent kanji/vocabulary from the series that you may want to learn next.
I’m planning to add every manga series that had a WaniKani book club that had an offshoot club that I participated in. These are:
Series that had their own club without being part of the main book clubs are a lower priority. These are:
P.S. If anyone wants to try them out and let me know of any issues you come across, I’d be quite thankful. I don’t have a second Google account to test with.
I had a look at the Kanji Series spreadsheet, and clicked into “Teasing Master Takagisan” spreadsheet. I was able to look at the kanji list and frequency. I notice that it’s only single kanji and not compound kanji… even so, it seemed surprising to me that 用 appeared only once! I have volumes 1 and 2 of that series, but didn’t poke through to compare. I’m at WK32, but only noticed one Kanji on a quick glance through that I didn’t know.
I’m glad that you were able to reduce steps with the new app
When you say compound kanji, do you mean vocabulary words that utilize two or more kanji?
It’s possible you were seeing 「からかい上手の (元) 高木さん」, which indeed uses 用 only once in the first two volumes. For 「からかい上手の高木さん」, it shows up 8 times in the first five volumes combined.
Aside from the kanji in 西片's name, it looks like the first one you haven’t encountered in WaniKani (although you could know it from outside of WaniKani) would be 緒, which is tied for 60th most frequent kanji in the series (first five volumes).
Here’s what the sheet looks like for me after I loaded it into my “master” sheet where I keep a list of my known/learning kanji, and I filtered out those I know or am learning:
The next one I may want to learn has a frequency of 29, and (not shown here) is the 57th most common kanji in the series. (It’s also one I “learned” in WaniKani, but don’t recognize at all.)
I had thought that would make it more readily found than mine in the Japanese section; I thought Campfire was open to the public (?)
It’s the opposite, Japanese Language and Wanikani are indexed by search engines, but you have to log in to view campfire threads
Today I added a column to my frequency spreadsheets to make the learning process more depressing.
(Not really, but it kind of feels like it!)
This new column tells me how many more words or kanji I need to learn to have 100% recognition for a manga. (Excluding words/kanji below a frequency threshold.)
For the first four volumes of the ARIA re-release, I still have 442 kanji left to learn. A big number!
But that’s with it set to only count kanji that show up at least 8 times. I set it at that because four volumes of the re-release cover about eight volumes of the original release, so I want to focus on kanji that appear at least 8 times.
If I drop that down to showing kanji that appear at least 2 times, then I get:
Now it’s 716 more kanji I still need to learn!
For all the series I’ve added to this new kanji frequency spreadsheet:
If I lower the frequency requirements to showing up at least 2 times, it becomes
Hopefully there’s a lot of overlap where one kanji is frequent for multiple series!
For vocabulary, the numbers of a bit less reliable because of parsing errors (not too much of a problem because I cross-reference a list of words mined from Japanese dictionaries) and character names that I haven’t marked as names (to exclude from the stats).
For vocabulary, these numbers actually look quite good! I just need to be less lazy and get to learning them.
If I likewise lower the minimum frequency to be counted, the numbers increase quite a bit:
This looks like Autumn Mode, where all the green turns to shades of red, orange, and yellow.
Time to get back to adding words to Anki!