Mokuro: Read Japanese manga with selectable text inside a browser

Gorbit99 · August 5, 2023, 3:43am

use either double backslashes (\) or forward slashes (/) in your path. What this is telling you is that the interpreter is trying to interpret \U for example as an escape sequence, and fails.

Akashelia · August 5, 2023, 4:21am

As Gorbit said. Or rewrite the above to:
C:\\Users\\user\\Desktop\\Set7\\Examples\\Man

Same with base_folder

ChristopherFritz · August 5, 2023, 4:59am

I hope to test this a bit on Windows this weekend.

I had planned to do that tonight, but setting up DKIM on an e-mail server took about three hours longer than the 5 mins I expected it to take…

ChristopherFritz · August 5, 2023, 7:58pm

Manga Text Search: Python Version

This is a re-write of the text search script, this one written in Python.

This one creates the “Searchable Text” files, so a separate process to extract the text is not needed.

Main File

search.py

from natsort import natsorted
import json, os, re

manga_folder = '/home/chris/Books/Comics/Japanese/'
search = "ありがとうご"

# Be sure to install the following:
#   pip install natsort

# Note: This assumes manga volume folders are organized into series folders, such as:
# - Dragon Ball
#   - Dragon Ball Volume 1
#   - Dragon Ball Volume 2
# - Sailormoon
#   - Sailormoon Volume 1
#   - Sailormoon Volume 2

def find_ocr_folders(directory):

    cache_file = 'ocr_folders.txt'
    subfolders_with_name = []

    if os.path.exists(cache_file):
        with open(cache_file, 'r') as file:
            subfolders_with_name = [line.strip() for line in file.readlines()]

    else:
        for dirpath, dirnames, filenames in os.walk(directory):
            if '_ocr' in dirnames:
                subfolders_with_name.append(os.path.join(dirpath, '_ocr'))
        with open(cache_file,'w+') as file:
            file.writelines('\n'.join(subfolders_with_name))

    return subfolders_with_name


def list_subdirectories(directory):

    subdirectories = []
    for entry in os.listdir(directory):
        entry_path = os.path.join(directory, entry)
        if os.path.isdir(entry_path):
            subdirectories.append(entry_path)
    return subdirectories


def get_parent_folder_name(path):

    parent_folder = os.path.basename(os.path.dirname(path))
    return parent_folder


def get_json_files(directory):

    json_files = [f for f in os.listdir(directory) if f.endswith(".json")]
    return json_files


def read_searchable_text_file(volume_directory, search_text_file):

    json_file_names = natsorted(get_json_files(volume_directory))
    if not os.path.isfile(search_text_file):
        return create_searchable_text_file(volume_directory, search_text_file)
    else:
        with open(search_text_file, 'r', encoding='utf-8') as file:
            return file.read()


def create_searchable_text_file(volume_directory, search_text_file):

    json_file_names = natsorted(get_json_files(volume_directory))
    output_texts = []
    if not os.path.isfile(search_text_file) or True:
        for json_file_name in json_file_names:
            json_file = os.path.join(volume_directory, json_file_name)
            with open(json_file, 'r', encoding='utf-8') as file:
                data = json.load(file)
            if not 'blocks' in data:
                continue
            for block in data['blocks']:
                if not 'lines' in block:
                    continue
                output_texts.append(f"{os.path.splitext(json_file_name)[0]}\t{''.join(block['lines'])}")

    output = '\n'.join(output_texts)

    with open(search_text_file, 'w', encoding='utf-8') as file:
        file.write(output)

    return output

def get_image_file_path(volume_directory, image_base_name):

    # Get the path to the original image.
    path_list = volume_directory.split(os.path.sep)
    path_list.remove('_ocr')
    image_directory = os.path.sep + os.path.join(*path_list)
    image_without_extension = os.path.join(image_directory, image_base_name)
    common_extensions = ['.jpg', '.jpeg', '.png']
    for extension in common_extensions:
        possible_file = image_without_extension + extension
        if os.path.isfile(possible_file):
            return possible_file
    return ''

def html_for_match(match, volume_directory, image_base_name):

    image_file_path = get_image_file_path(volume_directory, image_base_name)
    line_with_style = line_text[:match.start()] + '<strong>' + match.group() + '</strong>' + line_text[match.end():]
    return f"<li tabindex='0' onfocus='showImage(this, \"{image_file_path}\")'>{image_base_name}: {line_with_style}</li>"

# Get a list of all _ocr folders.
ocr_folders = natsorted(find_ocr_folders(manga_folder))

output_lines = []

output_lines.append('<link rel="stylesheet" href="styles.css">')
output_lines.append('<script src="script.js"></script>')

output_lines.append('<div id="parent">')
output_lines.append('<div id="matches">')

for ocr_folder in ocr_folders:
    series_name = get_parent_folder_name(ocr_folder)
    match_found_for_series = False
    volume_directories = natsorted(list_subdirectories(ocr_folder))
    for volume_directory in volume_directories:

        volume_name = os.path.basename(volume_directory)
        match_found_for_volume = False

        search_text_file = os.path.join(volume_directory, f'{volume_name} Searchable Text.txt')
        text = read_searchable_text_file(volume_directory, search_text_file)
        lines = text.strip().split('\n')
        for line in lines:
            image_base_name, line_text = line.split('\t')
            match = re.search(search, line_text)
            if match:
                if not match_found_for_series:
                    output_lines.append(f'<h2>{series_name}</h2>')
                    match_found_for_series = True
                if not match_found_for_volume:
                    output_lines.append(f'<h3>{volume_name}</h3>')
                    output_lines.append('<ul>')
                    match_found_for_volume = True
                output_lines.append(html_for_match(match, volume_directory, image_base_name))
        if match_found_for_volume:
            output_lines.append('</ul>')

output_lines.append('</div>')

output_lines.append('<div id="page">')
output_lines.append('<a id="link" target="_blank"><img id="image" /></a>')
output_lines.append('</div>')

output_lines.append('</div>')

output_file = 'results.html'
with open(output_file, 'w', encoding='utf-8') as file:
    file.write('\n'.join(output_lines))

Additional Files

This uses the script.js and style.css files from the original text search post.

Requirements

Python
natsort

You can install natsort from pip:

pip install natsort

On my setup, this line would be:

pip3.10 install natsort

I imagine there’s a variance for Windows.

Setup and Usage

Modify these two variables in the file to point to where you keep your manga series folders and what you want to search for (regular expression supported):

manga_folder = '/home/chris/Books/Comics/Japanese/'
search = "ありがとうご"

Run the file from a command-line prompt:

python search.py

The first run may take a bit of time while it locates all the _ocr folders. (If subsequent runs take too long, let me know and I can look into caching the _ocr folder list.)

Once the search completes, it will generate a file called results.html.

Caveats

Filesystem Structure

This expects that you have your manga images in per-volume folders, and that the volume folders are in per-series folders, such as:

Windows Support

I haven’t tried it out on Windows yet, but I wrote the code to be filesystem agnostic.

Pomidorka20142 · August 6, 2023, 4:16am

Thank you VERY much guys for all the support.

I changed them to doble backslashes in ruby file and made python file. That’s what I got when I try to run search.rb:

C:\Users\user\Desktop\Set7\Examples>ruby search.rb
<internal:G:/Programs/Ruby32-x64/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- naturally (LoadError)
        from <internal:G:/Programs/Ruby32-x64/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
        from search.rb:1:in `<main>'

That’s what I got when I try to run search.py:

C:\Users\user\Desktop\Set7\Examples>python search.py
  File "C:\Users\user\Desktop\Set7\Examples\search.py", line 130
    output_lines.append(f'<ruby lang = 'ja-JP'>h2<rp>(</rp><rt><span class='spoiler'>series_name</span></rt><rp>)</rp></ruby></h2>')
                        ^^^^^^^^^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

Gorbit99 · August 6, 2023, 4:28am

@ChristopherFritz

Those definitely seem somewhat strange there, maybe they were originally using double quotes?

@Pomidorka20142 this should work for the python version:
Use what ChristopherFritz fixed

As for this, this means, that you are missing the “naturally” gem. If you have ruby set up properly, gem install naturally should do the trick

ChristopherFritz · August 6, 2023, 4:35am

The Python one is fixed now. Please copy from the post and try again!

I use a userscript that adds furigana to kanji on the forums, and it tried converting some HTML into ruby… The ruby part shouldn’t be there.

Gorbit99 · August 6, 2023, 4:39am

Lol, now I see, that’s a source of error for sure. Now comes the time, where you start to worry, if maybe other code you’ve posted had this exact thing happen to it, but nobody tried them yet.

Pomidorka20142 · August 6, 2023, 5:33am

And still both ways somehow don’t work…

That’s what I got now:

C:\Users\user\Desktop\Set7\Examples\Manga>python search.py
Traceback (most recent call last):
  File "C:\Users\user\Desktop\Set7\Examples\Manga\search.py", line 123, in <module>
    text = read_searchable_text_file(volume_directory, search_text_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Desktop\Set7\Examples\Manga\search.py", line 53, in read_searchable_text_file
    return create_searchable_text_file(volume_directory, search_text_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Desktop\Set7\Examples\Manga\search.py", line 67, in create_searchable_text_file
    data = json.load(file)
           ^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1251.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 316: character maps to <undefined>

As for the ruby one, now it creates results.html but it’s empty. Here is my folder structure:

I’m trying this on Volume 105 of One Piece. On page 18 I have ちと拙者に剣術をおしえて with furigana, so I’m trying to search for 拙者.

# This needs to be manually changed to whatever I want to search for.
search = /拙者/

base_folder = 'C:\\Users\\user\\Desktop\\Set7\\Examples\\Manga'

manga_folder = "C:\\Users\\user\\Desktop\\Set7\\Examples\\Manga\\#{series}\\#{volume}"

Results.html:

<link rel="stylesheet" href="styles.css"><script src="script.js"></script><div id="parent"><div id="matches"></div><div id="page"><a id="link" target="_blank"><img id="image" /></a></div></div>

ChristopherFritz · August 6, 2023, 5:44am

This may be a bit inconvenient, but would it be possible for you to create a “One Piece” folder, then move (or copy) the “One Piece v105” folder and its “_ocr” folder into that, then try again?

The reason is that this code was originally designed for my personal use, so it expects my folder structure of “series folder” then “volume folder”.

It’s almost working! It’s just a matter of having the files where they will be seen.

There may still be an issue of needing to create the “Searchable Text” file before it will work, but we can cross that bridge later.

I’ve seen this error when I used Python 2.7, but I’m surprised to see it with Python 3.11. I’m on Python 3.10 and don’t get this error, so I’ll need to look into it when I have a chance. (I may be low on time for the next week, however.)

Pomidorka20142 · August 6, 2023, 6:33am

Tried it. No luck, unfortunately.

Uninstalled Python 3.11, installed 3.10, still the same problem.

Gorbit99 · August 6, 2023, 6:38am

Imagine reading more than 1 volume of a manga. Couldn’t be me.

ChristopherFritz · August 6, 2023, 6:39am

I’ve updated the code in my prior post to include the character set to use when opening files to read, write:

If you update the code from that prior post above, that may resolve the charmap issue.

ChristopherFritz · August 6, 2023, 6:40am

I only stopped buying/reading it because I switched to reading in Japanese with volume 1.

ncuh · August 11, 2023, 8:26am

Hey, I’m too stupid. Can somebody make a gui for this?=)
P.S. Does it really work with migaku? I got yotsuba html file(with manga ofc) from themoeway website, that used mokuro(mb, I’m not sure). It hardly works with migaku.
P.P.S. If somebody shared it with selectable text, does it mean, that u guys can share your manga too?=))
Yotsuba 1Vol is just 62mb. Mb somebody could make a downloadable library?

Gorbit99 · August 11, 2023, 8:44am

Sharing manga, or even just the text of manga would be piracy, which is against the rules of this forum. Though it’s a bit of a wasted effort, but everyone has to dedrm and ocr their own copies.

ChristopherFritz · August 11, 2023, 1:48pm

When you say “hardly”, does that mean it worked a little bit but not as fully as expected?

Or that it did not work at all?

Migaku doesn’t parse hidden content, so you need to hover your mouse over a textbox to review the OCR’d text then press the tilde (~) key on your keyboard for it to parse:

Image:
Screenshot_20230811_064252

Mouse-over then pressing ~ key:
Screenshot_20230811_064413

I know someone on the Migaku Discord wrote a script that allows Migaku to parse the whole manga volume (which could also be good for generating statistics), but I don’t know who or if they have a copy online somewhere.

ncuh · August 13, 2023, 2:31am

OMG=)
Ty, I was actually trying to just click shift here and there. Sometimes something appeared, and it was so sloppy, that why I called it Hardly.

ChristopherFritz · August 13, 2023, 2:34am

I’ve observed the same. I’ll shift-click and nothing happens, but maybe after a couple of seconds it pops up the window.

Something else to know is that sometimes after a certain page, pressing tilde no longer works, although it may start working again on a later page. I haven’t looked into why this happens, and I haven’t tried making a test case to submit alongside a bug report to Mokuro’s GitHub page.

Overall, it’s better than nothing!

ncuh · August 13, 2023, 2:38am

Yeah, my problem was I didn’t parse the page-)
P.S. I tried this Mokuro Manga.ipynb - Colaboratory (google.com) and it worked fine. So people with no intellegent like me can use mokuro.

Topic		Replies	Views
Recommendations for reading digital manga with OCR Reading	4	1515	November 20, 2024
How do you all go about reading ebooks? Reading	42	7486	December 7, 2021
Manga Kotoba: Manga Frequency Lists and Stats Resources	289	6676	January 18, 2026
How to create a vocabulary deck from manga? Resources	13	294	July 22, 2025
Kaku - Japanese OCR Dictionary API And Third-Party Apps	38	12474	August 25, 2022