Mokuro: Read Japanese manga with selectable text inside a browser

Manga Text Search

Update: I recommend using the Python implementation rather than this Ruby version.

This one is not currently available, but I can look into making it useable if there is interest. It’s currently tied to my directory structure and spans multiple programming languages.

The first part is a Ruby script that reads the JSON files from a volume folder, that were created by Mokuro.

This is the prominent function.
def extract_text_with_image_names(manga_path, json_files)

    output_file = File.join(manga_path, "#{File.basename(manga_path)} Searchable Text.txt")
    return if File.file?(output_file)

    File.open(output_file, "w+") do |f|
        json_files.each do |file|
            image_file_name = File.basename(file, File.extname(file))
            parsed = JSON.parse(open(file).read, symbolize_names: true)
            parsed[:blocks].each do |block|
                f.puts("#{image_file_name}\t#{block[:lines].join}")
            end
        end
    end

end

This generates a single text file containing every line of dialogue extracted by Mokuro and the file name of the page the text is from.

The file is saved into the same folder as the JSON files, but I also have a separate process that copies it to a central location where all these text files are stored.

With the manga folders in a specified location, with series folders and volume subfolders, and with all the text files copied in a single folder, I use a Ruby file to search through them and generate an HTML file with my results:

search.rb
require 'naturally'

# This needs to be manually changed to whatever I want to search for.
search = /何.{0,3}今の/

def output_match(match, manga_folder, search)
  image_filename, line_text = match.split("\t")
  image_file = "#{manga_folder}/#{image_filename}.jpg"
  image_file = "#{manga_folder}/#{image_filename}.jpeg" unless File.file?(image_file)
  unless File.file?(image_file)
    puts "Cannot find image file: #{image_file}"
    puts 'Maybe its extension is not .jpg or .jpeg?'
    return ''
  end
  line_text_with_html = line_text.gsub(/(#{search})/, '<strong>\1</strong>').chomp
  "<li tabindex='0' onfocus='showImage(this, \"#{image_file}\")'>#{image_filename}: #{line_text_with_html}</li>\n"
end

base_folder = '/home/chris/Documents/OCR/OCR Process/Outputs/OCR'

output = ''

output += '<link rel="stylesheet" href="styles.css">'
output += '<script src="script.js"></script>'

output += '<div id="parent">'
output += '<div id="matches">'

current_series = ''

files_to_check = Dir.glob("#{base_folder}/**/*.txt")
sorted_files_to_check = Naturally.sort(files_to_check)

sorted_files_to_check.each do |file_to_check|
  matches = []
  IO.foreach(file_to_check) do |line|
    matches.append(line) if line =~ search
  end
  next if matches.empty?

  # puts file_to_check
  series, volume = file_to_check.sub("#{base_folder}/", '').sub(' Searchable Text.txt', '').split('/')

  manga_folder = "/home/chris/Books/Comics/Japanese/#{series}/#{volume}"
  unless File.directory?(manga_folder)
    puts "Cannot find manga folder: #{manga_folder}"
    puts 'Implement checking other known locations.'
    next
  end

  if series != current_series
    output += "<h2>#{series}</h2>"
    current_series = series
  end
  output += "<h3>#{volume}</h3>"

  output += "<ul>\n"
  matches.each do |match|
    output += output_match(match, manga_folder, search)
  end
  output += "</ul>\n"
end
output += '</div>'

output += '<div id="page">'
output += '<a id="link" target="_blank"><img id="image" /></a>'
output += '</div>'

output += '</div>'

File.write('results.html', output)

A couple of files accompany the saved file:

script.js
var lastCopiedText = '';

function showImage(element, imagePath) {
	console.log(`ShowImage: ${imagePath}`)
	document.getElementById("link").href = imagePath;
	document.getElementById("image").src = imagePath;
	if (lastCopiedText != imagePath) {
		navigator.clipboard.writeText(imagePath);
		lastCopiedText = imagePath;
	}
}
style.css
a {text-decoration: none; color: black;}

h2 {
position: sticky;
background-color: white;
top: 0;
}

#parent {
display: flex;
justify-content: center;
height: 99%;
}

#matches {
width: 600px;
height: 99%;
overflow: auto;
}

#page {
width: 600px;
height: 99%;
}

ul {
list-style: none;
padding-left: 0;
}
li {
border: solid thin beige;
padding: 1px 2px;
cursor: pointer;
}
li:hover { background: lightblue; }
li:focus { background: pink; }

#image {width: inherit; position: sticky; max-height: 100%;}

strong {color: #ff003c; font-weight: normal;}

With those, I can change the search term in the Ruby file and re-run it. Then, open the generated HTML file in a web browser.

This gives a page where I can easily view manga pages (from my collection) that contain what I’m looking for:

(For anyone wondering how I can pull examples of any random vocabulary or grammar from several manga at a moment’s notice in book clubs…this is it.)

8 Likes