Good ideas all around. Alas, some of them have already gone into consideration and didn’t work out.
I’ve considered that, but I have this potentially unreasonable fear that I’ll land on a word that legitimately appears in dialogue in a series.
There’s also catching (for example) pages that are ads for another series:
The final piece is that if I want to get a count of pages with content (excluding non-content pages between chapters), it would be complex to write an algorithm that can distinguish between a blank between-chapters page with the manga logo on it, and an actual content page with no dialogue but also has the manga logo.
The only weakness here is that it’ll miss names without the suffix.
However, it could be a good way to auto-generate a list of all the names in a volume put into my exclusion list for a series, as that relies only on a name suffix being used at least once. I’ll have to consider this.
For now, I’ve been looking up web pages with names of all the characters in a series to quickly put together a list, but some smaller series just don’t have anything like that out there. The alternative has been to look at my generated lists for high-frequency words I don’t know (easy to find when I’ve filtered out the words I do know) and then check if those are names.
One thing I am doing is using a vocabulary word database I put together to exclude words that aren’t in the database. This actually does a decent job at removing a lot of names, but my database has some common names in it simply due to the various sources I extracted it from. I should probably download a names database and compare the two to find names in my database to remove.
Another consideration is that in some cases names are also valid vocabulary words. Although I mean real names, there are also manga characters with intentional names, such as うさぎ.
(At one point I forgot this guy’s name was 律, and couldn’t figure out why someone suddenly was talking about the law in a way that made no sense whatsoever.)
I feel like the more progress I make in projects, the more work I end up giving myself!
Edit: The newly-added code for locating names based on さん etc. is already catching names I missed (where they are written in kana rather than kanji):
からかい上手の高木さん 1
{'高木': 91, '西片': 2, 'ホ': 17, '中井': 1, 'おかげ': 1, 'R': 1, 'サナエ': 2, '本屋': 1, '表紙': 1}
からかい上手の高木さん 2
{'高木': 70, 'ホ': 22, '、': 1, '高尾': 1, '君': 1, 'サナエ': 1, 'ユカリ': 1, '中井': 4, '真野': 1, '私': 1, '9': 1, '.': 1}
からかい上手の高木さん 3
{'ホ': 45, '高木': 63, 'ダンディ': 1, '西片': 2, '中井': 3, '木村': 2, '真野': 1, 'ホラユカリ': 1, 'ユカリ': 1, 'サナエ': 1}
からかい上手の高木さん 4
{'高木': 61, 'ホ': 25, '高尾': 1, '西片': 1}
からかい上手の高木さん 5
{'ホ': 17, '高木': 57, '真野': 2, '中井': 3, 'サナエ': 1, '北条': 2}
からかい上手の高木さん 6
{'ホ': 13, '高木': 47, '私': 1, '中井': 14, '真野': 9, 'ちゃん': 1, 'オレ': 1, '地蔵': 1, '天川': 2, '北条': 3, '.': 1, '西片': 2}
からかい上手の高木さん 7
{'西片': 5, '高木': 46, 'ホ': 17, '木村': 1, 'か': 1}
からかい上手の高木さん 8
{'高木': 45, 'もらっと': 1, '北条': 1, 'ホ': 11, 'ミナ': 1, 'なかい': 1, '中井': 1, '私': 1}
からかい上手の高木さん 9
{'高木': 47, 'ホ': 9, 'ユカリ': 1, 'ちゃん': 1, '西片': 1, '真野': 1, '中井': 1, 'あんた': 1, '北条': 5, 'オレ': 1, '私': 1, 'サンタ': 1, '木': 1}
からかい上手の高木さん 10
{'ホ': 10, '高木': 35, 'ほめるおこるかみつくおばさじ': 1, '真野': 1, '私': 1, '西片': 2, '月本': 3}
A tiny bit of curation is still needed, but this will save me a good amount of time.