Physical books to e-book

MosbyNeko · December 19, 2019, 6:50pm

For 1dollarscan, I believe you have to pay extra for OCR, unless you subscribe to one of their monthly membership plans, which include most of the bells and whistles for no additional cost (including the expedited scanning, if I remember right). The few times I’ve used them it came out to be less expensive and faster to sign up for the smaller membership plan, scan a few books, and then cancel it that same month. (I’d normally feel weird a little weird doing this, but they explicitly suggested it on their membership page, so )

I doubt they’re doing any manual editing at that price, but just to be clear, the output is always a PDF that includes scanned images of each page. If you include the OCR option, they embed the OCR data in the PDF so that you can search it, select text, etc. So even if there are mistakes in the OCR data, it doesn’t really affect readability, since you’re looking at an image of the original page, not the OCR’d text directly, if that makes sense?

I think they have a separate service where you can have them convert the OCR’d output to other text formats, but I’m not sure what the cost is. I’d assume anything involving manual editing would be a lot more expensive, though.

YukiPhoenix · December 19, 2019, 7:08pm

Yeah, that makes sense. I’ve done the same to PDFs with Acrobat Pro. Thanks for the info.

Yeah I would expect it would be much more expensive, but thought I’d at least ask.

The normal service does sound very tempting at least for some of the books I don’t have much attachment to the physical copy.

Ncastaneda · December 23, 2019, 4:55pm

Well, the book is here… I mean, it was… then I thorn it apart… and then I started the scanning process.

I even realized my scanner has a document feeder (!!) now, though sadly been the book so small and the sheets so thin, I ended up using the flatbed anyways.

Now for the PDF OCR, Acrobat has brought some mixed results. The OCR itself for the most part is ok. Some mistakes when kanji strokes are too close to each other are there still, but actually the main issue is when replacing the image based characters with actual fonts, which was where my expectations were. Utter mess it turned to be! … japanese text with lines of text having different orientation (like the title of the book as a header in horizontal and then the paragraphs in vertical) alongside with furigana proved to be the doom for my aspirations of making the pdf into an e-book.

@koichi After watching the video from the japanese scan service company I was totally thrilled, since they mention doing the whole physical book to e-book service adjusted to specific devices even … sadly luck wasn’t with me on this one. The author is on the list of books they aren’t allowed to scan. …

So, PDF on a Kindle is how far I’ll get in this one. (that unless someone knows of a better tool to turn OCR pdf into e-books ).
I’m scanning the book by chapters, one each day. It’s not as difficult as I thought, though probably I won’t be repeating the experience.

koichi · December 23, 2019, 8:21pm

I wonder if you could send it to their other company 1dollarscan - out of Japan’s jurisdiction, though (assuming you’re in Japan) it’d be costly to send to the US just go get it scanned.

Or, how about an enterprising high school student looking for some UFO catcher money…?

Ncastaneda · January 8, 2020, 6:02pm

OK. A bit of an update with the project.

Finally, I’m doing all the steps by myself and it hasn’t turned to be such a terrible thing. For once scanners are now much faster than how I remember them. Using the flatbed scan is ok, at least to scan the book at a pace of 2 chapters every now and then (that’s about 30-35 pages), which amounts to 10-15 mins (at 600dpi that is).

With the scanned PDF, I go through Adobe’s OCR, which it is also much faster than what I recalled OCR was like.
OCR is not perfect though; it’s very accurate but for some reason it will recognize characters I thought would surely fail given the amount of strokes and how mushy they looked on the book (like 鬱 for example) and then consistently fail others that seemed easier like the 「興」from 興奮 for example. It also failed to recognize「壜」 which was the choice for “bottle” the author made (instead of 瓶). Which given the protagonist is an alcoholic, appears quite often .

For the most part I can’t really complain about character recognition (except maybe for 「く」 and the “<” sign getting mixed a couple of times) honestly, it has in no case become the bottleneck I thought it could be in the process.

Also, Adobe has a feature that can replace all the images of characters with fonts, so will make a clean document clearly separating characters and images (different from text). Sadly it seems that characters with different orientation (title of the book as header in horizontal with the rest in vertical japanese style format), japanese comas and specially furigana will play tricks with Acrobat’s feature rendering it useless for japanese.

仕方がないな。。。

The next step has been using this handy website

Here I can copy/paste the OCR lines (while avoiding to paste furigana and checking for characters not recognized correctly) into a notepad version and make it into txt format. Which is what then goes here

Where you can make that an epub file with the correct right to left vertical orientation style of ebook.

Epubs can be turn easily (just open and export as mobi) into Kindle format with the Kindle Previewer app.

And that’s it.

So, probably the best case scenario for digitalizing physical books into Kindle format is to live in Japan and use the service Koichi mentioned (though is the expensive version the one that will provide ebooks as outcome).

If not in Japan or prone to “DIY projects” type of person, I must say having now the key steps clear enough, I see myself repeating the process with some of the books in my wishlist that don’t have an ebook version currently.

konekush · January 9, 2020, 8:46pm

Can anybody tell me what this means? Do they accept the book (and I’m assuming the others in the series), or is it one of the no-nos?

The publisher:

Should I bother battling with the Japanese, or should I just give up?

Ncastaneda · January 9, 2020, 9:44pm

It’s ok.

This is what you get when there’s a formal prohibition from the author

konekush · January 9, 2020, 9:57pm

On the one hand, yay.

On the other, I need to figure out how to navigate this thing now.

Thank you!!

konekush · January 10, 2020, 3:21pm

Ok, I can’t figure it out. @koichi would you be willing to let me know what the process is and what I should be clicking/filling? Is there an issue with me being from outside of Japan? Is there a support email address and/or an English support channel?

koichi · January 10, 2020, 4:10pm

What’re you having trouble with?

eefi · January 10, 2020, 4:44pm

It’s one of those things I’m still trying to figure out for myself. I’ve gone through the steps of converting ebooks into .txt so I can use Yomichan in the browser. So for any book my most preferred format is actually txt.
Then I also found that I like working through textbooks and JLPT prep books much more on my tablet instead of printing it out. I also always make Anki cards from the grammar example sentences. I’ve now written a page where I can paste a screenshot and it’ll give me the text OCR’d (through Google Vision API) with some more input fields and a button to feed it directly into Anki.
The same thing could theoretically be used to OCR whole pages, I’m currently testing that but Google Vision throws away a lot of furigana when the text is vertical so that’s what I’m trying to solve right now…

konekush · January 10, 2020, 4:46pm

I managed to pay for 8 volumes! I’m stuck here:

From what I understand, I need to print out this page and put it in the cardboard box along with the books, and write the transaction number on the box, which I can’t do if I’m ordering directly from Amazon to them. I also found this in their FAQ, which says only premium members can send to them directly from Amazon because otherwise books won’t reach there on the correct day.

I also think that under 10 volumes, I can send the books at any time? Does that negate the sending from Amazon directly ban for non-premium members?

Long babbling made short: What do I need to write in the Amazon address for it to reach them, without having to insert a print (maybe post it as a gift from Amazon and put the transaction number and my user ID in the greeting card?) or write the transaction number on the box itself, and when should I send it?

[confused pistachio.gif]

Edit: I don’t mind paying extra for having them receive it directly from Amazon, if needed.

Edit 2+3: This is what I put on the Amazon address in the meantime, based on the example in 1dollarscan but with what amazon.co.jp would accept in an address. The striked out part is my transaction number and then the user number .