Yojijukugo (四字熟語) frequencies in Aozora Bunko

After my first try to get a grasp on Yojijukugo, I went a bit further this time to look at some actual Yojijukugo, ha!

Instead of just looking at words made up of 4 kanji, which yields a lot of 4-kanji phrases that are not in the narrow and common definition of yojijukugo, I compared this popular list of yojijukugo with the Aozora corpus (an “online collection [that] encompasses several thousands of works of Japanese-language fiction and non-fiction.”)

The results:

  • 一生懸命 appeared the second most often and is one of the few that also appeared in the previous post I made.
  • The most common is 自分自身.
  • These two are about 8% of all yojijukugo. They appear by far most often. Afterwards, the distribution is extremely flat.
  • In the whole corpus, only 3072 of the listed 5802 appear.
  • Koichi claims that he found countings up to 20k instances, but if even Aozora (as opposed to e.g. a news corpus) does not contain them, it tells you something about their rarety: ultra rare!
  • A related work can be found in this post that mentions this list of top 50 instances. I find the idioms mentioned on that page only in the lower (non-zero) percentages of my fuill list. I think those are actually more idiom-y while many of the top ones found here are a bit less idiom-y. Lastly, here is a list for school children.

Finally, I made a nice plot:

And here are the top 100:

四字熟語 count %* meaning
自分自身 4258.0 4.55 oneself
一生懸命 3747.0 4.00 very hard
一人一人 1312.0 1.40 one by one
不可思議 1203.0 1.28 mystery
馬鹿野郎 1030.0 1.10 godamn idiot
神経衰弱 932.0 0.99 nervous breakdown
行方不明 923.0 0.98 missing of a person
彼方此方 771.0 0.82 here and there
御無沙汰 652.0 0.69 not writing for a while
一歩一歩 549.0 0.58 step by step
四方八方 528.0 0.56 in all directions
一所懸命 484.0 0.51 very hard
自由自在 469.0 0.50 free
無我夢中 462.0 0.49 being absorbed in
無茶苦茶 423.0 0.45 nonsensical
徹頭徹尾 411.0 0.43 thoroughly
右往左往 406.0 0.43 moving about in confusion
前後左右 402.0 0.42 in all directions
滅茶滅茶 394.0 0.42 disorderly
傍若無人 380.0 0.40 acting w/o consideration for others
武者修行 364.0 0.38
中途半端 361.0 0.38
相談相手 357.0 0.38
半信半疑 357.0 0.38
因果関係 350.0 0.37
生存競争 340.0 0.36
言語道断 322.0 0.34
前代未聞 315.0 0.33
二言三言 305.0 0.32
実際問題 303.0 0.32
無理矢理 293.0 0.31
自業自得 292.0 0.31
自暴自棄 292.0 0.31
年中行事 286.0 0.30
先祖代々 279.0 0.29
生真面目 277.0 0.29
面目次第 273.0 0.29
大胆不敵 269.0 0.28
一心不乱 267.0 0.28
一部始終 258.0 0.27
二度三度 257.0 0.27
面白半分 256.0 0.27
絶体絶命 254.0 0.27
老若男女 243.0 0.25
正真正銘 238.0 0.25
荒唐無稽 235.0 0.25
一日二日 232.0 0.24
言文一致 230.0 0.24
遮二無二 230.0 0.24
一挙一動 218.0 0.23
異口同音 213.0 0.22
文明開化 213.0 0.22
弥次喜多 209.0 0.22
昨日今日 207.0 0.22
愚図愚図 205.0 0.21
東西南北 204.0 0.21
一世一代 196.0 0.20
公明正大 195.0 0.20
一朝一夕 195.0 0.20
一語一語 193.0 0.20
一体全体 192.0 0.20
半死半生 190.0 0.20
挙国一致 187.0 0.19
大真面目 184.0 0.19
人事不省 183.0 0.19
支離滅裂 180.0 0.19
不得要領 180.0 0.19
立身出世 180.0 0.19
義理人情 179.0 0.19
精神作用 179.0 0.19
後生大事 178.0 0.19
一目瞭然 175.0 0.18
第一印象 172.0 0.18
前後不覚 171.0 0.18
種々様々 170.0 0.18
一伍一什 170.0 0.18
潜在意識 169.0 0.18
神経過敏 168.0 0.17
紳士淑女 167.0 0.17
不真面目 167.0 0.17
自然淘汰 166.0 0.17
千差万別 163.0 0.17
無二無三 163.0 0.17
利害関係 163.0 0.17
自問自答 162.0 0.17
時々刻々 161.0 0.17
尋常一様 161.0 0.17
大同小異 161.0 0.17
風俗習慣 158.0 0.16
馬鹿正直 157.0 0.16
四十九日 156.0 0.16
黄金時代 156.0 0.16
悪戦苦闘 155.0 0.16
人身御供 154.0 0.16
今日明日 154.0 0.16
不承不承 154.0 0.16
真一文字 151.0 0.16
春夏秋冬 150.0 0.16
勝手次第 149.0 0.15
縦横無尽 148.0 0.15

* refers to percent of 四字熟語 found

11 Likes

Interesting analysis, thanks. I think this confirms my existing view that there’s no need to make a special study of yojijukugo – they’re just not common enough to be worth treating differently to any other vocabulary.

4 Likes

I’m a little surprised to see 行方不明 below 神経衰弱, but it’s not a massive difference. Kind of wondering also if it’s not a case of past frequency vs more modern literature. It would be interesting to do a comparison of the Aozora books vs works from 1990s and later.

Anecdotally, I see 一生懸命 and 懸命 more often than 自分自身. Actually, 第一印象 also a couple of times.

3 Likes

Yeah. Note that the percentages are percentages of yojijukugo. So 4% means that if you encounter a yojijukugo, 4% chance that it is the top one etc. But the probability of coming across a yojijukugo is already pretty small. Doing segmentation of words and having percentages normalized over the whole text would be a whole other piece of work but just spitballing: if only every 1000th word is a yojijukugo you can see that the probability of recognizing one is extremely low. It’s like Kanken pre-1 and higher type of knowledge it feels.

2 Likes

Agreed to all you said. Of course, certain yojijukuko will be more popular in the past or now. I was considering a news corpus but just from how I looked into it previously, there is not going to be much at all. All anecdotal observations are hard to make sense of because all of these yojijukugo are so rare that it’s essentially just random except for the top ones, I think.

But yeah, never actually saw one in the wild, haha.

1 Like

You might come across tons of “fake” yojijukugo like 電気会社 or 建設会社 in news articles :joy:

4 Likes

This is literally me 2 months ago. ლ(///´◜⊜`//////ლ)

4 Likes

Except when you plan to sit any higher-level Kanken :stuck_out_tongue_winking_eye:

5 Likes

Or if you’re coming up with crackpot theories to solve a WaniKani ARG (I came across a number of interesting ones back then).

1 Like

I think it is well known the non-idiomatic forms will out rank by sheer frequency. Everything I’ve seen on the Sanbo site has been all idiomatic and their focus (and arguable ‘true yojijukugo’). How exactly they are sourcing, they only write「毎日平均22,000人のアクセスデータを集計し、ランキング表示しました。」otherwise I’m not exactly sure how the data bank is accruing their frequency lists. The site posted some fun yojijukugo associated with past manga that is fun read for anyone interested.

I think they come up in surprising ways, I’m not that well read at all but they definitely come up in manga I’ve seen; I’m sure others here who read more can comment better. Maybe just the type of content one is reading. I was thrilled to recognize 魑魅魍魎 a couple days ago, had it circulating in my deck forever and it finally showed its face, ha. IDK, if children are studying then some light frequency lists seemed like a good idea as an adult learner, or at least a fun way stretch some extra kanji + cultural learning…some a pretty funny too. I suppose for Kanken learners, it is a given to study them.

3 Likes

That one I know from a visual novel – but I remember it a lot better because I remember the context it came up in…

2 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.