Statistical Analysis of Manga Text

Continuing from here:

2 Likes

This is something I’ve tried to work out, but never got past the stage of “try random things and see if anything magically works out”.

Here was my playground for that:

I didn’t feel I came up with anything successful, but here is what went into it:

Manga Tabs

Several tabs are for individual manga series. These included multiple volumes, so (for example) the 「からかい上手の高木さん」 tab may include five volumes, and the 「名探偵コナン」 tab may include 12 volumes. This inconsistency shouldn’t matter (as seen in the comparison further below).

For each series, I used Mokuro and Juman++ to OCR the text. This is imperfect but really good; results can be so-so with lower-quality manga images.

I created a frequency list of words for the volumes of the series, then I included columns for:

  • Whether the word appears in a dictionary (sourcing word lists from Wiktionary and various Japanese dictionaries)
  • Whether the word is common (according to Jisho)
  • The JLPT level of the word (based on random JLPT lists found online; very imprecise)
  • A word score, based on the above (which I adjusted to test so much that it’s basically broken)
  • An overall score based on the word score and the frequency

This overall score tells me…nothing, since the word scoring is broken.

Totals Tab

This uses information from the series pages.

Columns B through D are based on the broken scoring. They are not included in the rest of the columns.

Column E tells how many pages are in the material. I don’t recall if I excluded pages without words, but regardless this helps average the words per page, so the number of volumes and pages in a volume don’t matter.

Columns F through P try to create a score based on the JLPT level and frequency of that level per page on average. Then roll that into a score.

This should produce a low number for easier series, and a high number for more difficult series.

The idea is that if a series has easy text, but each page is packed with words, the score goes up. If a series is sparse on text, but the words are really uncommon or complex, the score goes up.

Next, columns Q through U try to utilize word count and unique word count to create a score seen in column V.

Results

The score in column P seems to do an okay job of ordering series based on difficulty.

It puts 「レンタルおにいちゃん」 as easier than 「ちいさな森のオオカミちゃん」, which I disagree with.

I also consider 「ARIA」 to be more difficult than 「三ツ星カラーズ」, but the results shown may depend on whether I included non-dialogue pages in the scoring.

I certainly would not place 「耳をすませば」 as more difficult than 「ふらいんぐうぃっち」.

These numbers also do not factor in WaniKani kanji levels, nor whether furigana is used.

The score in column V moves 「ちいさな森のオオカミちゃん」 to number 7 on the list, after series such as 「orange」, so that scoring is also off.

1 Like

Out of curiosity (and a vague personal theory), is there any correlation between whether or not something is a yon-koma and its position on thte list? :slightly_smiling_face:

1 Like

My goal isn’t really to make a single number that you can just compare and be done with it, I don’t think that’s easily possible.

Instead what I envision is you give it a book, it does some magic, and spits back a list of factors you could consider. For example like this:

Manga stats:
Unique word count: 3123 [ABBC 2134 - BBC 5287]
Text density: 34.2 [BBC 33.1 - IBC 54.2]
Word bubbles: 392 [ABBC 214 - IBC 432]
No. of meaningful pages: 213 [BBC 198 - IBC 298]
...

And for example from these stats you could consider the book you’ve entered as a BBC book, because in most cases it’s decently close to that.

2 Likes

The 4koma 「ご注文はうさぎですか?」 consistently comes out as highest difficulty from my selection of manga.

The easier 4koma 「ひとりぼっちの○○生活」 when using my JLPT scoring comes in as easier than 「ハヤテのごとく!」 and 「名探偵コナン」, which sounds right to me.

So I’ve discovered!

Admittedly, my spreadsheet is more about sorting through things I plan to read to see which would be a good target to read next. I included it above in case there might be any interesting ideas for your use.

2 Likes

So when are you admitting finally, that making graphs, stats, lists and such is the fun part and reading Japanese is the way to get to that?

2 Likes

I’ll admit, learning about the INDIRECT() function likely reduced my manga reading time for a few weeks.

2 Likes

By thought balloons, do you mean all speech bubbles? (Not just thinking cloud-like balloons?)

With Mokuro, this is made a little tricky as sometimes it’ll join disconnected dialogue from a two-part balloon (which can be good) and other times it will split a balloon (not so good):

The split on one balloon, at least in a few examples I’ve looked at, is predictable in how one overlaps another. Maybe to the point that an update could be made to Mokuro to catch it.

That, yes.

Tbh, only going for a “better-than-by-eye” estimate, if it’s 129 speech bubbles vs 135, that’s fine.

2 Likes