After my first try to get a grasp on Yojijukugo, I went a bit further this time to look at some actual Yojijukugo, ha!
Instead of just looking at words made up of 4 kanji, which yields a lot of 4-kanji phrases that are not in the narrow and common definition of yojijukugo, I compared this popular list of yojijukugo with the Aozora corpus (an “online collection [that] encompasses several thousands of works of Japanese-language fiction and non-fiction.”)
The results:
一生懸命 appeared the second most often and is one of the few that also appeared in the previous post I made.
The most common is 自分自身.
These two are about 8% of all yojijukugo. They appear by far most often. Afterwards, the distribution is extremely flat.
In the whole corpus, only 3072 of the listed 5802 appear.
Koichi claims that he found countings up to 20k instances, but if even Aozora (as opposed to e.g. a news corpus) does not contain them, it tells you something about their rarety: ultra rare!
A related work can be found in this post that mentions this list of top 50 instances. I find the idioms mentioned on that page only in the lower (non-zero) percentages of my fuill list. I think those are actually more idiom-y while many of the top ones found here are a bit less idiom-y. Lastly, here is a list for school children.
Interesting analysis, thanks. I think this confirms my existing view that there’s no need to make a special study of yojijukugo – they’re just not common enough to be worth treating differently to any other vocabulary.
I’m a little surprised to see 行方不明 below 神経衰弱, but it’s not a massive difference. Kind of wondering also if it’s not a case of past frequency vs more modern literature. It would be interesting to do a comparison of the Aozora books vs works from 1990s and later.
Anecdotally, I see 一生懸命 and 懸命 more often than 自分自身. Actually, 第一印象 also a couple of times.
Yeah. Note that the percentages are percentages of yojijukugo. So 4% means that if you encounter a yojijukugo, 4% chance that it is the top one etc. But the probability of coming across a yojijukugo is already pretty small. Doing segmentation of words and having percentages normalized over the whole text would be a whole other piece of work but just spitballing: if only every 1000th word is a yojijukugo you can see that the probability of recognizing one is extremely low. It’s like Kanken pre-1 and higher type of knowledge it feels.
Agreed to all you said. Of course, certain yojijukuko will be more popular in the past or now. I was considering a news corpus but just from how I looked into it previously, there is not going to be much at all. All anecdotal observations are hard to make sense of because all of these yojijukugo are so rare that it’s essentially just random except for the top ones, I think.
But yeah, never actually saw one in the wild, haha.
I think it is well known the non-idiomatic forms will out rank by sheer frequency. Everything I’ve seen on the Sanbo site has been all idiomatic and their focus (and arguable ‘true yojijukugo’). How exactly they are sourcing, they only write「毎日平均22,000人のアクセスデータを集計し、ランキング表示しました。」otherwise I’m not exactly sure how the data bank is accruing their frequency lists. The site posted some fun yojijukugo associated with past manga that is fun read for anyone interested.
I think they come up in surprising ways, I’m not that well read at all but they definitely come up in manga I’ve seen; I’m sure others here who read more can comment better. Maybe just the type of content one is reading. I was thrilled to recognize 魑魅魍魎 a couple days ago, had it circulating in my deck forever and it finally showed its face, ha. IDK, if children are studying then some light frequency lists seemed like a good idea as an adult learner, or at least a fun way stretch some extra kanji + cultural learning…some a pretty funny too. I suppose for Kanken learners, it is a given to study them.