Statistical Analysis of Manga Text

ChristopherFritz · June 24, 2023, 3:23pm

Absolute Beginners Book Club // Now reading: Horimiya // Reading Next: Miss Shikimori is not just cute!

I was thinking about making a program, that could help estimate the relative difficulty of a manga. Since, without reading the entire thing, the best one can do is take the number of pages and the fact if the manga has furigana or not, and maybe flip through the book to see some estimated text density.

Furigana is at least easy to check, even with a sample, but both the number of pages could be misleading (think Happiness, where there were a bunch of empty pages), and the text density is hard to gauge, especially if it changes a bunch, or you are more used to harder manga.

Some metrics, that I thought of and could be easily extracted from a mokuro query for example would be:

Number of thought bubbles

Number of characters per page

Pages with at least some text on them (or even pages with at least a given amount of text on them)

Number of unique words

The N levels of the words in the manga (this would need to be combined into a single number probably)

These could then be compared to the average of the previous picks, that were deemed appropriate for the given club, and from that one would have an at least decent idea about how that specific book compares to the usual pick.

Are the any other measures, that indicate the difficulty of a given book and probably aren’t hard to extract?

ChristopherFritz · June 24, 2023, 3:23pm

This is something I’ve tried to work out, but never got past the stage of “try random things and see if anything magically works out”.

Here was my playground for that:

I didn’t feel I came up with anything successful, but here is what went into it:

Manga Tabs

Several tabs are for individual manga series. These included multiple volumes, so (for example) the 「からかい上手の高木さん」 tab may include five volumes, and the 「名探偵コナン」 tab may include 12 volumes. This inconsistency shouldn’t matter (as seen in the comparison further below).

For each series, I used Mokuro and Juman++ to OCR the text. This is imperfect but really good; results can be so-so with lower-quality manga images.

I created a frequency list of words for the volumes of the series, then I included columns for:

Whether the word appears in a dictionary (sourcing word lists from Wiktionary and various Japanese dictionaries)
Whether the word is common (according to Jisho)
The JLPT level of the word (based on random JLPT lists found online; very imprecise)
A word score, based on the above (which I adjusted to test so much that it’s basically broken)
An overall score based on the word score and the frequency

This overall score tells me…nothing, since the word scoring is broken.

Totals Tab

This uses information from the series pages.

Columns B through D are based on the broken scoring. They are not included in the rest of the columns.

Column E tells how many pages are in the material. I don’t recall if I excluded pages without words, but regardless this helps average the words per page, so the number of volumes and pages in a volume don’t matter.

Columns F through P try to create a score based on the JLPT level and frequency of that level per page on average. Then roll that into a score.

This should produce a low number for easier series, and a high number for more difficult series.

The idea is that if a series has easy text, but each page is packed with words, the score goes up. If a series is sparse on text, but the words are really uncommon or complex, the score goes up.

Next, columns Q through U try to utilize word count and unique word count to create a score seen in column V.

Results

The score in column P seems to do an okay job of ordering series based on difficulty.

It puts 「レンタルおにいちゃん」 as easier than 「ちいさな森のオオカミちゃん」, which I disagree with.

I also consider 「ARIA」 to be more difficult than 「三ツ星カラーズ」, but the results shown may depend on whether I included non-dialogue pages in the scoring.

I certainly would not place 「耳をすませば」 as more difficult than 「ふらいんぐうぃっち」.

These numbers also do not factor in WaniKani kanji levels, nor whether furigana is used.

The score in column V moves 「ちいさな森のオオカミちゃん」 to number 7 on the list, after series such as 「orange」, so that scoring is also off.

Belthazar · June 24, 2023, 3:48pm

Out of curiosity (and a vague personal theory), is there any correlation between whether or not something is a yon-koma and its position on thte list?

Gorbit99 · June 24, 2023, 3:49pm

My goal isn’t really to make a single number that you can just compare and be done with it, I don’t think that’s easily possible.

Instead what I envision is you give it a book, it does some magic, and spits back a list of factors you could consider. For example like this:

Manga stats:
Unique word count: 3123 [ABBC 2134 - BBC 5287]
Text density: 34.2 [BBC 33.1 - IBC 54.2]
Word bubbles: 392 [ABBC 214 - IBC 432]
No. of meaningful pages: 213 [BBC 198 - IBC 298]
...

And for example from these stats you could consider the book you’ve entered as a BBC book, because in most cases it’s decently close to that.

ChristopherFritz · June 24, 2023, 3:57pm

The 4koma 「ご注文はうさぎですか？」 consistently comes out as highest difficulty from my selection of manga.

The easier 4koma 「ひとりぼっちの○○生活」 when using my JLPT scoring comes in as easier than 「ハヤテのごとく！」 and 「名探偵コナン」, which sounds right to me.

So I’ve discovered!

Admittedly, my spreadsheet is more about sorting through things I plan to read to see which would be a good target to read next. I included it above in case there might be any interesting ideas for your use.

Gorbit99 · June 24, 2023, 3:59pm

So when are you admitting finally, that making graphs, stats, lists and such is the fun part and reading Japanese is the way to get to that?

ChristopherFritz · June 24, 2023, 4:05pm

I’ll admit, learning about the INDIRECT() function likely reduced my manga reading time for a few weeks.

ChristopherFritz · June 25, 2023, 2:22am

By thought balloons, do you mean all speech bubbles? (Not just thinking cloud-like balloons?)

With Mokuro, this is made a little tricky as sometimes it’ll join disconnected dialogue from a two-part balloon (which can be good) and other times it will split a balloon (not so good):

The split on one balloon, at least in a few examples I’ve looked at, is predictable in how one overlaps another. Maybe to the point that an update could be made to Mokuro to catch it.

Gorbit99 · June 25, 2023, 5:28am

That, yes.

Tbh, only going for a “better-than-by-eye” estimate, if it’s 129 speech bubbles vs 135, that’s fine.

system · June 24, 2024, 5:29am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[Tool] Manga Analyzer - Get quick stats about your manga API And Third-Party Apps	6	747	August 15, 2023
How to Gauge Reading Level of a Text (mostly children's novels) Reading	15	934	February 21, 2021
Website for checking text difficulty API And Third-Party Apps	22	1693	October 19, 2023
How can I approximately gauge the difficulty of a(n unknown) book? Reading	9	674	June 24, 2022
Lexical complexity statistics of LN's Reading	11	862	December 2, 2020

Statistical Analysis of Manga Text

Manga Tabs

Totals Tab

Results

Related topics