Here, Have a List of Aozora Books by WK Level

Hello!
I’m always looking for things to do that are not work on my thesis, so the other day I wrote a script to take all of the books on Aozora Bunko and sort them by WK level. Here’s the result! For each book on Aozora, it tells you at what WK level you can read 80, 85, 90, and 95% of the unique kanji. (If people would be interested in a list that looks at percent of all kanji rather than unique kanji, I can probably make that happen too.) I hope this is useful! Link below.

List of Books

Feel free to ask me any questions you have! I also recommend opening it in Excel (if your version of Excel supports Japanese characters) so you can sort it easily however you want to.

69 Likes

That’s a lot of books.

But oof, what’s with the ones down the bottom where you’d struggle to even reach 80% of the kanji? Are they written in historical kanji or something?

6 Likes

I know, right? It’s a little terrifying - and this list doesn’t include books where you can’t even read 80% of the kanji at level 60! Since these works are all in the public domain, they’re generally pretty old, so archaic kanji is probably a good part of it. Otherwise, only being on level 11 myself, I don’t know enough about the rest of the kanji to make a good guess :confused: If someone wants to look through the unreadable kanji, I’m happy to pull some of that data out and hand it over!

I admit I’m idly curious as to how your script works. Possibly asking the obvious, but did you, for example, remember to exclude kana and punctuation from the totals? I’m having a quick glance through 革命の研究 at the moment, but nothing’s really jumping out at me.

1 Like

I used regular expressions [re.findall(r’[㐀-䶵一-鿋豈-頻]’,bk_text), thanks to this Stack Overflow topic] to pull out only the kanji from the text, then did the calculations using that. I looked at what it grabbed from a few different books and it seemed to do a good job of excluding kana/digits/punctuation/etc. However, it is possible that some of the books use weird encoding that isn’t encapsulated in this regular expression, and I admit I have no idea how it handles characters with furigana (though I think that if it left out all characters that use furigana it would underestimate the necessary WK level, not overestimate it).

5 Likes

Here are the 178 kanji my script marked as non-WK kanji (out of 908) from book ID 50263, 革命の研究 クロポトキン: [‘經’, ‘來’, ‘溺’, ‘勃’, ‘恢’, ‘吏’, ‘斥’, ‘歸’, ‘繼’, ‘畫’, ‘犧’, ‘當’, ‘臺’, ‘鋤’, ‘屡’, ‘鎚’, ‘拜’, ‘僞’, ‘邊’, ‘壞’, ‘慄’, ‘昧’, ‘兒’, ‘點’, ‘毀’, ‘舊’, ‘壯’, ‘滿’, ‘寇’, ‘晝’, ‘總’, ‘餌’, ‘證’, ‘戟’, ‘會’, ‘僭’, ‘陷’, ‘廢’, ‘做’, ‘馴’, ‘顏’, ‘叛’, ‘將’, ‘殘’, ‘鑄’, ‘勸’, ‘關’, ‘讐’, ‘實’, ‘獨’, ‘發’, ‘險’, ‘峽’, ‘勞’, ‘殆’, ‘雜’, ‘尤’, ‘灣’, ‘對’, ‘贊’, ‘或’, ‘帶’, ‘爭’, ‘憬’, ‘默’, ‘惧’, ‘禮’, ‘痺’, ‘穢’, ‘觸’, ‘嚴’, ‘墮’, ‘揆’, ‘溌’, ‘隸’, ‘亂’, ‘寢’, ‘樂’, ‘伽’, ‘參’, ‘檢’, ‘豫’, ‘戮’, ‘噺’, ‘牢’, ‘乘’, ‘饉’, ‘逐’, ‘斧’, ‘輕’, ‘擴’, ‘槍’, ‘與’, ‘些’, ‘圍’, ‘曾’, ‘厭’, ‘變’, ‘髓’, ‘氣’, ‘礙’, ‘體’, ‘戲’, ‘處’, ‘缺’, ‘膽’, ‘數’, ‘聯’, ‘於’, ‘轡’, ‘册’, ‘聽’, ‘餘’, ‘愈’, ‘據’, ‘蔽’, ‘獸’, ‘萬’, ‘曖’, ‘驅’, ‘壓’, ‘辿’, ‘莫’, ‘國’, ‘鬪’, ‘腦’, ‘輿’, ‘從’, ‘黨’, ‘濫’, ‘觀’, ‘圖’, ‘戰’, ‘坐’, ‘瞞’, ‘爲’, ‘顫’, ‘顛’, ‘臆’, ‘賣’, ‘朧’, ‘勿’, ‘惡’, ‘遲’, ‘權’, ‘盡’, ‘續’, ‘稱’, ‘廣’, ‘云’, ‘臟’, ‘聲’, ‘讓’, ‘斷’, ‘濟’, ‘廻’, ‘辨’, ‘掠’, ‘饑’, ‘嘗’, ‘爵’, ‘鐵’, ‘學’, ‘傳’, ‘狹’, ‘騷’, ‘靈’, ‘瞥’, ‘拂’, ‘眞’, ‘蔭’, ‘彦’, ‘區’, ‘攝’, ‘嚇’, ‘藝’, ‘奧’, ‘勵’] Some of them look pretty simple… lemme look into a different regex and see what that does?

1 Like

Some of them also seem to be a bit dated - for example, 奧 is an older form of 奥, which is on WaniKani. Not that I can actually find 奧 in the version of the story I’m looking at, though 奥 is there. Maybe I’ve got a different edition…

Edit: Aha, I was looking at the book with ID 50507, which is also present on your list, but much further up.

That makes a lot of sense! I tried a couple more regexes and they gave me the same result, so I bet there are a lot of cases like the example you found. Ah, the perils of public domain reading material!

Ah, the perils of major typographic overhauls. :slightly_smiling_face:

Just for a quick example, here’s the first sentence of each - top one is 50507, bottom one is 50263. Aside from the kanji, there’s an archaic ~といふ there as well. Even native speakers might even struggle to read bits without a dictionary, though of course context helps.

革命という言葉は、今では、被圧制者の唇にも、また所有者の唇にすらも、しばしば上る。
革命といふ言葉は、今では、被壓制者の唇にも、また所有者の唇にすらも、屡々上る。

2 Likes

Yup, that too :laughing: This is fascinating though! …and makes me feel much more optimistic about how much Japanese I’ll actually be able to read when (if? no, when!) I hit level 60.

3 Likes

Thank you for making this!!

1 Like

I’m happy that it’s helpful! Looking forward to next level when I can finally read 80% of my first book :relaxed:

2 Likes

Sorry for the noob question but if I am reading that list correctly, I should be able to start reading some of the books on the list once I hit Level 10?

If that’s the case… only 6 more levels to go! :smiley:

3 Likes

You can even start earlier than that :slight_smile: It’s just that you’d need to look up some more kanji (but that’s not too painful as you can copy-paste them into Jisho and look them up).

The bigger limitation is usually your level of Japanese grammar and general vocab knowledge, though.

But there’s nothing wrong with giving these books a try and see what you can get out of them!

6 Likes

Here’s my thing with grammar. I’m able to understand a lot of spoken Japanese watching TV shows, Anime, etc so I think, as long as I can “hear” the words in my head, I can HOPEFULLY (fingers crossed) understand it without getting the grammar rules straight in my head. But I can’t do that until I know more Kanji unless it is all written in kana which… well I don’t know if that is useful for learning Kanji. :sweat_smile:

I think I will bookmark this list (already did) and see what I can do around level 8 in a couple weeks… or months maybe given my speed. :smiley:

2 Likes

Wow, sounds like your level of understanding spoken language is pretty high!

I just had a look at the lowest-rated story from that list, かいじん二十めんそう, and it’s actually mostly in Kana, so even if you don’t know some of the kanji yet, they are pretty rare anyways in that text.

It’s pretty hard to tackle everything at once though… Learning to read text is probably still a bit different from listening to TV because it is written and not spoken language after all… And learning Kanji is yet another skill. I think there’s nothing wrong with somewhat separating those two at least in the beginning, but ymmv of course :slight_smile:

3 Likes

A lot of these works also have furigana on some kanji (especially the tougher kanji), so if you’re pretty good with spoken Japanese you might actually be able to handle books that are marked as higher level!

4 Likes

Hero! Thanks!

2 Likes

:+1:

A lot of those you can also read on Yomi.ai with some extra helpful features, so feel free to include the Yomi.ai link to your list, too.

If you have an account with us and WaniKani, you can also sort all of our readings based on your burned kanji.

2 Likes

Hey @kemily88 :slight_smile: I was just thinking of doing something similar to what you’ve done, except I was thinking to make it so you can paste a sentence / paragraph in and get the readability level according to the WK difficulty. BUT I’ve never done any string matching with japanese before, and I’m not a super great programmer (to put it mildly).

How long did this take you to do? Did you happen to share it on github by any chance?

2 Likes