use either double backslashes (\) or forward slashes (/) in your path. What this is telling you is that the interpreter is trying to interpret \U for example as an escape sequence, and fails.
As Gorbit said. Or rewrite the above to:
C:\\Users\\user\\Desktop\\Set7\\Examples\\Man
Same with base_folder
I hope to test this a bit on Windows this weekend.
I had planned to do that tonight, but setting up DKIM on an e-mail server took about three hours longer than the 5 mins I expected it to take…
Manga Text Search: Python Version
This is a re-write of the text search script, this one written in Python.
This one creates the “Searchable Text” files, so a separate process to extract the text is not needed.
Main File
search.py
from natsort import natsorted
import json, os, re
manga_folder = '/home/chris/Books/Comics/Japanese/'
search = "ありがとうご"
# Be sure to install the following:
# pip install natsort
# Note: This assumes manga volume folders are organized into series folders, such as:
# - Dragon Ball
# - Dragon Ball Volume 1
# - Dragon Ball Volume 2
# - Sailormoon
# - Sailormoon Volume 1
# - Sailormoon Volume 2
def find_ocr_folders(directory):
cache_file = 'ocr_folders.txt'
subfolders_with_name = []
if os.path.exists(cache_file):
with open(cache_file, 'r') as file:
subfolders_with_name = [line.strip() for line in file.readlines()]
else:
for dirpath, dirnames, filenames in os.walk(directory):
if '_ocr' in dirnames:
subfolders_with_name.append(os.path.join(dirpath, '_ocr'))
with open(cache_file,'w+') as file:
file.writelines('\n'.join(subfolders_with_name))
return subfolders_with_name
def list_subdirectories(directory):
subdirectories = []
for entry in os.listdir(directory):
entry_path = os.path.join(directory, entry)
if os.path.isdir(entry_path):
subdirectories.append(entry_path)
return subdirectories
def get_parent_folder_name(path):
parent_folder = os.path.basename(os.path.dirname(path))
return parent_folder
def get_json_files(directory):
json_files = [f for f in os.listdir(directory) if f.endswith(".json")]
return json_files
def read_searchable_text_file(volume_directory, search_text_file):
json_file_names = natsorted(get_json_files(volume_directory))
if not os.path.isfile(search_text_file):
return create_searchable_text_file(volume_directory, search_text_file)
else:
with open(search_text_file, 'r', encoding='utf-8') as file:
return file.read()
def create_searchable_text_file(volume_directory, search_text_file):
json_file_names = natsorted(get_json_files(volume_directory))
output_texts = []
if not os.path.isfile(search_text_file) or True:
for json_file_name in json_file_names:
json_file = os.path.join(volume_directory, json_file_name)
with open(json_file, 'r', encoding='utf-8') as file:
data = json.load(file)
if not 'blocks' in data:
continue
for block in data['blocks']:
if not 'lines' in block:
continue
output_texts.append(f"{os.path.splitext(json_file_name)[0]}\t{''.join(block['lines'])}")
output = '\n'.join(output_texts)
with open(search_text_file, 'w', encoding='utf-8') as file:
file.write(output)
return output
def get_image_file_path(volume_directory, image_base_name):
# Get the path to the original image.
path_list = volume_directory.split(os.path.sep)
path_list.remove('_ocr')
image_directory = os.path.sep + os.path.join(*path_list)
image_without_extension = os.path.join(image_directory, image_base_name)
common_extensions = ['.jpg', '.jpeg', '.png']
for extension in common_extensions:
possible_file = image_without_extension + extension
if os.path.isfile(possible_file):
return possible_file
return ''
def html_for_match(match, volume_directory, image_base_name):
image_file_path = get_image_file_path(volume_directory, image_base_name)
line_with_style = line_text[:match.start()] + '<strong>' + match.group() + '</strong>' + line_text[match.end():]
return f"<li tabindex='0' onfocus='showImage(this, \"{image_file_path}\")'>{image_base_name}: {line_with_style}</li>"
# Get a list of all _ocr folders.
ocr_folders = natsorted(find_ocr_folders(manga_folder))
output_lines = []
output_lines.append('<link rel="stylesheet" href="styles.css">')
output_lines.append('<script src="script.js"></script>')
output_lines.append('<div id="parent">')
output_lines.append('<div id="matches">')
for ocr_folder in ocr_folders:
series_name = get_parent_folder_name(ocr_folder)
match_found_for_series = False
volume_directories = natsorted(list_subdirectories(ocr_folder))
for volume_directory in volume_directories:
volume_name = os.path.basename(volume_directory)
match_found_for_volume = False
search_text_file = os.path.join(volume_directory, f'{volume_name} Searchable Text.txt')
text = read_searchable_text_file(volume_directory, search_text_file)
lines = text.strip().split('\n')
for line in lines:
image_base_name, line_text = line.split('\t')
match = re.search(search, line_text)
if match:
if not match_found_for_series:
output_lines.append(f'<h2>{series_name}</h2>')
match_found_for_series = True
if not match_found_for_volume:
output_lines.append(f'<h3>{volume_name}</h3>')
output_lines.append('<ul>')
match_found_for_volume = True
output_lines.append(html_for_match(match, volume_directory, image_base_name))
if match_found_for_volume:
output_lines.append('</ul>')
output_lines.append('</div>')
output_lines.append('<div id="page">')
output_lines.append('<a id="link" target="_blank"><img id="image" /></a>')
output_lines.append('</div>')
output_lines.append('</div>')
output_file = 'results.html'
with open(output_file, 'w', encoding='utf-8') as file:
file.write('\n'.join(output_lines))
Additional Files
This uses the script.js and style.css files from the original text search post.
Requirements
- Python
- natsort
You can install natsort from pip:
pip install natsort
On my setup, this line would be:
pip3.10 install natsort
I imagine there’s a variance for Windows.
Setup and Usage
Modify these two variables in the file to point to where you keep your manga series folders and what you want to search for (regular expression supported):
manga_folder = '/home/chris/Books/Comics/Japanese/'
search = "ありがとうご"
Run the file from a command-line prompt:
python search.py
The first run may take a bit of time while it locates all the _ocr folders. (If subsequent runs take too long, let me know and I can look into caching the _ocr folder list.)
Once the search completes, it will generate a file called results.html.
Caveats
Filesystem Structure
This expects that you have your manga images in per-volume folders, and that the volume folders are in per-series folders, such as:
Windows Support
I haven’t tried it out on Windows yet, but I wrote the code to be filesystem agnostic.
Thank you VERY much guys for all the support.
I changed them to doble backslashes in ruby file and made python file. That’s what I got when I try to run search.rb:
C:\Users\user\Desktop\Set7\Examples>ruby search.rb
<internal:G:/Programs/Ruby32-x64/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- naturally (LoadError)
from <internal:G:/Programs/Ruby32-x64/lib/ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
from search.rb:1:in `<main>'
That’s what I got when I try to run search.py:
C:\Users\user\Desktop\Set7\Examples>python search.py
File "C:\Users\user\Desktop\Set7\Examples\search.py", line 130
output_lines.append(f'<ruby lang = 'ja-JP'>h2<rp>(</rp><rt><span class='spoiler'>series_name</span></rt><rp>)</rp></ruby></h2>')
^^^^^^^^^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
@ChristopherFritz

Those definitely seem somewhat strange there, maybe they were originally using double quotes?
@Pomidorka20142 this should work for the python version:
Use what ChristopherFritz fixed
As for this, this means, that you are missing the “naturally” gem. If you have ruby set up properly, gem install naturally should do the trick
The Python one is fixed now. Please copy from the post and try again!
I use a userscript that adds furigana to kanji on the forums, and it tried converting some HTML into ruby… The ruby part shouldn’t be there.
Lol, now I see, that’s a source of error for sure. Now comes the time, where you start to worry, if maybe other code you’ve posted had this exact thing happen to it, but nobody tried them yet.
And still both ways somehow don’t work…
That’s what I got now:
C:\Users\user\Desktop\Set7\Examples\Manga>python search.py
Traceback (most recent call last):
File "C:\Users\user\Desktop\Set7\Examples\Manga\search.py", line 123, in <module>
text = read_searchable_text_file(volume_directory, search_text_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\Desktop\Set7\Examples\Manga\search.py", line 53, in read_searchable_text_file
return create_searchable_text_file(volume_directory, search_text_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\Desktop\Set7\Examples\Manga\search.py", line 67, in create_searchable_text_file
data = json.load(file)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1251.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 316: character maps to <undefined>
As for the ruby one, now it creates results.html but it’s empty. Here is my folder structure:
I’m trying this on Volume 105 of One Piece. On page 18 I have ちと拙者に剣術をおしえて with furigana, so I’m trying to search for 拙者.
# This needs to be manually changed to whatever I want to search for.
search = /拙者/
base_folder = 'C:\\Users\\user\\Desktop\\Set7\\Examples\\Manga'
manga_folder = "C:\\Users\\user\\Desktop\\Set7\\Examples\\Manga\\#{series}\\#{volume}"
Results.html:
<link rel="stylesheet" href="styles.css"><script src="script.js"></script><div id="parent"><div id="matches"></div><div id="page"><a id="link" target="_blank"><img id="image" /></a></div></div>
This may be a bit inconvenient, but would it be possible for you to create a “One Piece” folder, then move (or copy) the “One Piece v105” folder and its “_ocr” folder into that, then try again?
The reason is that this code was originally designed for my personal use, so it expects my folder structure of “series folder” then “volume folder”.
It’s almost working! It’s just a matter of having the files where they will be seen.
There may still be an issue of needing to create the “Searchable Text” file before it will work, but we can cross that bridge later.
I’ve seen this error when I used Python 2.7, but I’m surprised to see it with Python 3.11. I’m on Python 3.10 and don’t get this error, so I’ll need to look into it when I have a chance. (I may be low on time for the next week, however.)
Imagine reading more than 1 volume of a manga. Couldn’t be me.
I’ve updated the code in my prior post to include the character set to use when opening files to read, write:
If you update the code from that prior post above, that may resolve the charmap issue.
Hey, I’m too stupid. Can somebody make a gui for this?=)
P.S. Does it really work with migaku? I got yotsuba html file(with manga ofc) from themoeway website, that used mokuro(mb, I’m not sure). It hardly works with migaku.
P.P.S. If somebody shared it with selectable text, does it mean, that u guys can share your manga too?=))
Yotsuba 1Vol is just 62mb. Mb somebody could make a downloadable library?
Sharing manga, or even just the text of manga would be piracy, which is against the rules of this forum. Though it’s a bit of a wasted effort, but everyone has to dedrm and ocr their own copies.
When you say “hardly”, does that mean it worked a little bit but not as fully as expected?
Or that it did not work at all?
Migaku doesn’t parse hidden content, so you need to hover your mouse over a textbox to review the OCR’d text then press the tilde (~) key on your keyboard for it to parse:
Image:

Mouse-over then pressing ~ key:

I know someone on the Migaku Discord wrote a script that allows Migaku to parse the whole manga volume (which could also be good for generating statistics), but I don’t know who or if they have a copy online somewhere.
OMG=)
Ty, I was actually trying to just click shift here and there. Sometimes something appeared, and it was so sloppy, that why I called it Hardly. ![]()
I’ve observed the same. I’ll shift-click and nothing happens, but maybe after a couple of seconds it pops up the window.
Something else to know is that sometimes after a certain page, pressing tilde no longer works, although it may start working again on a later page. I haven’t looked into why this happens, and I haven’t tried making a test case to submit alongside a bug report to Mokuro’s GitHub page.
Overall, it’s better than nothing!
Yeah, my problem was I didn’t parse the page-)
P.S. I tried this Mokuro Manga.ipynb - Colaboratory (google.com) and it worked fine. So people with no intellegent like me can use mokuro.





