How to create a vocabulary deck from manga?

I’ve always wanted to read Ghost in the Shell 攻殻機動隊 but it’s way beyond me… for now!
I own these manga on Bookwalker.jp and would love to have a vocabulary frequency list for them so I can track when I get to the 60-70% range and can start reading.
I can’t find a pre-made word list so I probably have to make my own. I get the sense it should be possible to extract the text from the images with a tool and turn that into a nice frequency list / Anki deck.

Can anyone point me in the right direction of how to go about this?

  • edit - I also have access to Kitsun if that helps.

Thanks!

1 Like

There is a vocab list for the anime on JPDB: Ghost in the Shell: Stand Alone Complex – Prebuilt decks – jpdb
(EDIT: Oh I found some more lists: Prebuilt decks – jpdb)
I neither know the manga nor the anime, so I don’t know how similar the vocab are, but it might be a good starting point.

1 Like

I’ve seen those. But sadly the Stand Alone Complex is a spin-off series. I haven’t found word lists for the original manga (Ghost in the Shell, Ghost in the Shell 1.5 and Ghost in the Shell 2)

2 Likes

yeah i feel like your best bet if you dont want to do it manually is to mokuro it, parse the OCR data and automate a frequency list/anki deck creator or something.

@ChristopherFritz might be able to help, if he is so inclined

2 Likes

Past!ChristopherFritz can help too. I’ve been using this procedure, though I stopped after step 7 and did the rest by hand, because installing the stuff needed for step 8 was starting to look more tedious than I was prepared to deal with. :stuck_out_tongue:

3 Likes

There’s supposedly this thing called mokuro but I never used it. Maybe another WK user can help you ^^ Man the community is lightning fast…

My own solution for Higurashi is the following steps with chatGPT.

1 - Text extraction

prompt:
“Please extract the Japanese text in the image and give it to me in a markdown container. Do not change the original text. Just transcribe exactly as it is in the image.”

Example:

2 - Add text to JPDB.

2.1 - Create a new empty jpdb deck
2.2 - Add the text


Remove the unwanted items by deleting them and when you’re happy finally select the “Add vocabulary to your deck”

3 - ...
4 - Profit !
4 Likes

You could try ChatGPT. I Upload Screenshots of my next book and Tell ChatGPT

  1. To make a list of all words that appear in one Screenshot (= one page). Then I Copy Paste it into jpdb.

  2. To make a list of Just the new vocabulary that didn’t appear in the previous screenshots.

1 Like

Since you are using Bookwalker, you might need a script to extract images from a volume of manga. Mokuro: Read Japanese manga with selectable text inside a browser - #280 by ChristopherFritz

Or you might not need a script. Just Print Screen, and turn pages, and repeat.

Then, paste into a GPT, though my favorite for uploading images is DeepSeek, being able to upload image without a hard limit.

Then, make a prompt to extract words.

Make a list of all Japanese words with reading and meaning, sorted by most common first.

Then, on the same chat thread, reply with more images.

Then, when you have pasted all images, or you want to put on hold anyway, reply by telling GPT to summarize.

Summarize into a list of all words sorted by most common first.

Something like this.


You might also try the jpdb method. I think there is a UserScript to export to Anki or to CSV. (Not sure about details on this part.)

If anyone gets through step 6, then sends me the JSON files, I can extract vocabulary lists to add to my manga frequency list website Manga Kotoba, for everyone to benefit from =D

My most recent test with using ChatGPT to extract text from manga, which I tried out a few weeks ago, gave terrible results, unfortunately.

5 Likes

That’s very generous. I have never tinkered with this stuff, but if I manage to extract the text I’ll be certain to send it to you. Thanks! :person_bowing:

2 Likes

Progress has been made!

I’m working on making screenshots of all the pages so I can Mokuro them.
I test drove Mokuro on the first two chapters. It seems to do well with the text bubbles. But it has a harder time with some of the text boxes with a more creative art direction. Most I can correct by a manual extract with the Windows ‘Photos’ → Scan Text (bonus: you can rotate the page first for sideways text :))

The second page is very hard to parse for the tools though. And it’s completely skipped in the English translation :upside_down_face:
I can’t decipher the Kanji. Half of them I don’t know and I really strugle with this hand-written style tbh. If anyone can verify / correct these machine extracted bits I’d be very grateful!

p2 WARNING

WARNING
この本の一部あるいは全部を士郎
正宗に無断ご機製したり情貸業に
使用すると 法的に罰せうれるだ
りでなく工郎正家のいがかかり
ます。ただし個人で楽しむ湯合は
この限りどはありまぜん。

p2 CAUTION

当単行本は 欄外の補足説明又が多い為、
作品と欄外文を同時進行でお読みになりま
すと、混れを招きやすく、又作品の流れが
ゴ断されて楽し心を損いますので、作品と
欄外文は別々にお楽しみ頂くのがよろしい
かと字じます りまにい液**

又、この物語はフィクションであり 登場
する名燃·設定·小道具·説明文は全て架
空の産物です。ごの単行本の肩報によって
読者或いばその周辺話氏が何らかの損害又
は安信に至,たとしても当方は一切責任を
持ちませんのご思しかうす。それら全しを
純料に楽しんで便ければ幸いかす
土郎正宗 7.7.1991.

** The text tool was adamant that the blurred text is in that section. An Easter egg?

p6 こ れ は 1998 年

これは1998年、播磨研究学園都市て創られた
成長型ニューロチッブを5万倍に拡大したも
の。過剰成長で細胞が死にかけて、各所で神
経繊維の断裂が見られる。ボリスチレンに乳
糖(ガラグドース)をのせた誘導体などで構
成される端子にま?????長し、瑞子を?
刷してある薄膜を?ませている。同月、マイ
クロマシンによる補助電?を?う医?に、メ
ディアを核とする巨木資本がネットを始め、
電脳技術はマイクロマシンをベースにしたも
のに移ってゆく。2028年、ニューロチップは
AIやロボットに多く使用されている

Here it’s mostly the number tag that causes issues.

1 Like

I didn’t realise before I started this reply how much longer the second and third messages are, so I’m gonna have to put a pause in here and come back later (unless someone else gets here first).

Edit: Misread one of the kanji in the mangaka’s name. I’ve fixed it.

4 Likes

Lemme try. Normally you should put names and vocabs into equation as well. Line breaks are ignored and words can break anywhere.

Think of it as vocab practice.

p2 WARNING

WARNING
この本の一部あるいは全部を士郎
正宗に無断機製したり賃貸業(ちんたいぎょう)
使用すると 法的に罰せられる ----- passive of する; だけ got broken into 2 lines
でなく士郎正宗(のろ)がかかり ----- check spelling of the name
ます。ただし個人で楽しむ場合(ばあい)
この限りではありません

p2 CAUTION

当単行本は 欄外の補足説明文(せつめいぶん)が多い為、
作品と欄外文を同時進行でお読みになりま
すと、混乱(こんらん)を招きやすく、又作品の流れが
寸断(すんだん)されて(たの)しさを損いますので、作品と -------- 寸断(すんだん) (I’ve had trouble with this one.)
欄外文は別々にお楽しみ頂くのがよろしい
かと(ぞん)じます

又、この物語はフィクションであり登場
する名称(めいしょう)·設定·小道具·説明文は全て架 -------- 架空(かくう) got broken into 2 lines
空の産物です。ごの単行本の情報(じょうほう)によって
読者(ある)いはその周辺諸氏(しょし)が何らかの損害又
妄信(もうしん)(いた)ったとしても当方は一切責任を
持ちませんので ()しからず。それら全しを
純粋(じゅんすい)に楽しんで(いただ)かれば幸いです

土郎正宗 7.7.1991。

p6 こ れ は 1998 年

これは1998年、播磨研究学園都市て創られた
成長型ニューロチップを5万倍に拡大したも
過剰成長で細胞が死にかけて、各所で神 ----- 神経(しんけい) got broken into 2 lines
経繊維の断裂が見られる。ポリスチレンに乳 ----- I’ve had problem with polystyrene; 乳糖(にゅうとう) got broken into 2 lines.
糖(ガラクトース)をのせた誘導体などで構
成される端子にまで 繊維(せんい) 成長(せいちょう)し、瑞子を(いん) ----- 印刷(いんさつ) got broken into 2 lines
(さつ)してある薄膜 (ひず)ませている。同月、マイ ----- not sure about actual reading of 歪ませている; マイクロマシン got broken into 2 lines
クロマシンによる補助電脳(でんのう)使(つか) 医療(いりょう)に、メ ----- メディア got broken into 2 lines
ディアを核とする巨大(きょだい)資本がネットを始め、
電脳技術はマイクロマシンをベースにしたも ----- もの got broken into 2 lines
のに移ってゆく。2028年、ニューロチップは
AIやロボットに多く使用されている

3 Likes

@polv @Belthazar

2 Likes