I’ve always wanted to read Ghost in the Shell 攻殻機動隊 but it’s way beyond me… for now!
I own these manga on Bookwalker.jp and would love to have a vocabulary frequency list for them so I can track when I get to the 60-70% range and can start reading.
I can’t find a pre-made word list so I probably have to make my own. I get the sense it should be possible to extract the text from the images with a tool and turn that into a nice frequency list / Anki deck.
Can anyone point me in the right direction of how to go about this?
edit - I also have access to Kitsun if that helps.
yeah i feel like your best bet if you dont want to do it manually is to mokuro it, parse the OCR data and automate a frequency list/anki deck creator or something.
Past!ChristopherFritz can help too. I’ve been using this procedure, though I stopped after step 7 and did the rest by hand, because installing the stuff needed for step 8 was starting to look more tedious than I was prepared to deal with.
There’s supposedly this thing called mokuro but I never used it. Maybe another WK user can help you ^^ Man the community is lightning fast…
My own solution for Higurashi is the following steps with chatGPT.
1 - Text extraction
prompt:
“Please extract the Japanese text in the image and give it to me in a markdown container. Do not change the original text. Just transcribe exactly as it is in the image.”
If anyone gets through step 6, then sends me the JSON files, I can extract vocabulary lists to add to my manga frequency list website Manga Kotoba, for everyone to benefit from =D
My most recent test with using ChatGPT to extract text from manga, which I tried out a few weeks ago, gave terrible results, unfortunately.
I’m working on making screenshots of all the pages so I can Mokuro them.
I test drove Mokuro on the first two chapters. It seems to do well with the text bubbles. But it has a harder time with some of the text boxes with a more creative art direction. Most I can correct by a manual extract with the Windows ‘Photos’ → Scan Text (bonus: you can rotate the page first for sideways text :))
The second page is very hard to parse for the tools though. And it’s completely skipped in the English translation
I can’t decipher the Kanji. Half of them I don’t know and I really strugle with this hand-written style tbh. If anyone can verify / correct these machine extracted bits I’d be very grateful!
I didn’t realise before I started this reply how much longer the second and third messages are, so I’m gonna have to put a pause in here and come back later (unless someone else gets here first).
Edit: Misread one of the kanji in the mangaka’s name. I’ve fixed it.
WARNING
この本の一部あるいは全部を士郎
正宗に無断で機製したり賃貸業に
使用すると 法的に罰せられる だ ----- passive of する; だけ got broken into 2 lines
けでなく士郎正宗の呪いがかかり ----- check spelling of the name
ます。ただし個人で楽しむ場合は
この限りではありません。
当単行本は 欄外の補足説明文が多い為、
作品と欄外文を同時進行でお読みになりま
すと、混乱を招きやすく、又作品の流れが
寸断されて楽しさを損いますので、作品と -------- 寸断 (I’ve had trouble with this one.)
欄外文は別々にお楽しみ頂くのがよろしい
かと存じます。