Which site for kanji readings (discrepancies)?

Apologies if this has been discussed before but couldn’t find a thread when I searched ‘kanji readings’.

I’m looking just at ichi.moe, jpdb and kakimashou and they all have different kanji reading-commonality statistics… Let’s take 早 as an example with the はや and ソウ readings:

  • ichi.moe has はや at 59.09% and ソウ at 31.28%.
  • jpdb.io has はや at 74% and ソウ at a mere 6%.
  • kakimashou has はや at 89.4% and ソウ at an even smaller 2.5%.

So we have between 59%-89% for はや and a very wide range for ソウ at 2.5%-31%

I’m aware working with native material eventually will help but at an early stage, I wonder if anyone has an idea of which site(s) are more accurate, or how to tell? All three show はや is an important reading but ソウ is super varied.

1 Like

What’s going on here is kanji readings depend on which word they’re in, and so the different numbers depend on which texts they use for frequency analysis. e.g. its not that 早 is read はや or そう according to reader’s preference, but that 早く and 早い use はや and are way more common words than e.g. 早速 (さっそく) or 早々 (そうそう)。However, all these words are reasonably common and if you read native material you will encounter all of them.

Generally, most kanji have two common readings, the kun’yomi reading which is used in words that contain just that kanji, or that kanji and some kana, and the on’yomi reading which is used in compound readings with multiple kanji. However some kanji (and disproportionately the most basic/commonly used) will have multiple in one or both categories.

Generally Wanikani assigns the most common on’yomi reading as “the” kanji reading in the app, as it’s going to be the most useful for the number of compound words that wanikani will teach later. However, there are cases like 早 where they choose to focus on a kun’yomi reading as the kun’yomi words are more commonly used.

JPDB at least bases it on the material they provide decks for, so it’s going to be relatively biased towards words used in anime/manga/light novels, which is going to give more weight to simpler and more modern language. Another might use Aozora, which is an archive of ebooks with expired copyright, which would give a higher comparative weighting to literary words and older words for example.

14 Likes

I can imagine all different ways of calculating a percentage to slap next to readings. If the sites don’t tell you what the numbers mean, there’s probably not much reason to rely on them.

3 Likes

I’m not familiar with this website but I think it just gives you the % of unique words in the DB which have a specific reading. For instance if you look at 挨 on ichi.moe it tells you that it has a single match, probably because it only knows of one word with this kanji, the relatively common 挨拶.

On the other hand if you look at 何 it tells you that the カ reading is used 7% of the time and なに only 26% which is laughable. JPDB and kakimashou have much more reasonable stats with カ at 0%, which makes sense since I can only think of one semi-common word using this reading: 幾何学 (geometry).

So ichi.moe is fairly useless for these stats.

Generally speaking I wouldn’t rely on these types of stats to decide which reading to learn, instead it may be better to look at high-frequency words containing the kanji and then look at what readings they tend to use. For instance the サッ reading of 早 is quite useful but only for one very common word, 早速, so you’re probably better off learning this specific word on its own rather than memorizing a サッ reading on the kanji alone.

8 Likes

More generally: the aim is to learn words, not kanji readings. Learning a kanji reading as a separate step is worthwhile only to the extent that it is a stepping stone on the way to learning a word or words. With experience you’ll work out to what extent that stepping stone is necessary for you personally, and to what extent it’s just unnecessary extra work…

6 Likes

Okay, thank you for these responses - they’re all really helpful and useful for me. Interesting where jpdb will likely take its data from and that ichi.moe (which I found quite a few years ago now) is pretty useless for readings. But I take everything into account that you all say, especially regarding words and kanji. Makes sense. Really appreciate it :heavy_heart_exclamation:

3 Likes

Yeah frankly ichi.moe should probably add some text explaining clearly what these percentages mean (i.e. pretty much nothing) because as it is it’s quite misleading.

I had thought it to look the most reliable as well, as it’s been around a little while and seemed more balanced with per cents… totally throwing off us simple beginners :smiling_face_with_tear:

1 Like

ichi.moe creator here. The percentages are explained in the description of the Anki deck that was created on the basis of this calculation. Basically it is the percentage each reading is used in the sample of the common Japanese words (the same words you learn in the abovementioned Anki deck). The words are considered “common” on the basis of the open-source JMdict database. So if there are 26 common words with reading はや (or ばや) and 14 common words with reading そう (or ぞう) that’s exactly what you get.

One could object that the word 早い is the most frequent word that uses this character and so the “frequency” should be biased more to はや. But if you want to learn the language, you have to learn all the words that use this character not just 早い. And the On’yomi reading is actually more useful the deeper you learn.

Same thing with 何. The thing that @simias is missing in their analysis is that なん is another very common reading of 何 even in very common words, and it accounts for almost half of the sample words. And in compound words なに would naturally occur less often than なん. Think all the counters like 何十, 何本, 何枚 etc.

The reading か also appears in words like 如何にも, but all in all it accounts for 5 words or 7% of common words, which is accurate. Again, knowing On’yomi reading is actually very helpful because it often allows you to guess the pronunciation of the unknown words that you’ve never seen before.

1 Like

I see where you’re coming from, although I still feel like the number you get with your approach is of very limited use IMO. Like your 何 with counter example for instance, are they really individual words or just variations of the same prefix? With your algo this decision significantly influences the results.

But overall you can argue that the approach used by other websites is flawed in other ways. One advantage of your method is that it’s not really skewed by the choice or text corpus for instance.

In the end the best way to decide if a reading is useful or not is probably to look at the vocab and decide for yourself…

1 Like