Let's Help Digitize Newspapers (Hoji Shinbun Japanese Diaspora Initiative)

What to Know

Stanford University has an online archive of overseas Japanese newspapers called Hoji Shinbun. The archive has 1,130,286 pages in total, but only 97 of those pages have corrected transcriptions.

The more pages are known to be correctly transcribed, the better the quality of the OCR gets for the remaining pages, so it would be great to get more of the transcriptions checked over.

There is a variety of different material for different levels of learners who might want to participate. Absolute beginners could start with the English sections or English papers (like H. C. & S. Breeze), with checking article title transcriptions, or with correcting kana mistakes (often に is mis-transcribed as !:, た as だ, etc.), intermediate learners could start with newspapers that have furigana like the Hawaii Times or Hawai Sutā, and those looking for a particular challenge could try their hand at transcribing the sometimes hand-written newspapers of the 1800s.

Personally, I have been finding it an enjoyable and meaningful way to expand my vocabulary and get some experience reading older texts these past couple of days.

Some of the newspapers like The Rafu Shimpo are still around, but there are also newspapers like Asahi which share names with currently-active newspapers yet are not the same paper.

Thread's Purpose

  • Look at this cool image/kanji/info I found
  • I can’t read this word, looking for help
  • Discussing the archive (note: I’m not affiliated with it in any way so I can’t answer questions beyond what’s on the site)
  • Sharing your stats (rank, lines corrected)

18-1900s Japanese vs Modern Japanese

The first thing you might notice looking over the archive’s issues is that the kanji look a bit… funny. 変 may look like 變 or 広い like 廣い. Those “funny” kanji are (often) 旧字体きゅうじたい, Japan’s version of traditional characters. You get used to them with repeated exposure. 新字体しんじたい, the kanji we use now, were made standard in 1946, but it took time for usage to catch up. This may have been doubly true of the diaspora with the Hawaii times still using 旧字体きゅうじたい in the 1980s. (This is a complete guess on my part, but it probably wasn’t the most convenient of things for a 1950s local newspaper to import large amounts of new type from across the world.) I would recommend using a reference such as this one. I’ve also been noting down some words that I couldn’t read for my own reference since this morning.

Some of my notes so far 應ぜず(おうせず)
叮嚀(ていねい)
盡く(つく)
解釋(かいしゃく)
廣吿(こうこく)
言辭(げんじ)
欺僞(さぎ)
總て(すべて)

Other odd kanji you might come across are 略字りゃくじ like 㐰(信、個) or somewhat obscure numerical kanji like 廿・ 卄 (20)

The kana usage might seem a little different, too. The furigana for 港 might be “かう” rather than “こう”, or 伝える might be written as 伝へる. This is called 歴史的れきしてき仮名かなづかい. It’s not necessary to know for checking transcriptions, but it can be useful for reading comprehension. There is a list on this webpage.

You may also encounter some classical grammar. In place of where you’d expect 帰らない might be 帰らず or 痛ませる might be 痛ましむ. If you want to understand those a little bit better, I recommend the トライイット course on 古文 . There are also a good deal of textbooks aimed at high schoolers taking university examinations that might come in handy. Just keep in mind that, especially with older articles, the grammar might not be something you want to reference in your own writing or need to learn in order to read modern Japanese media.

Stats as of April 7th (@ me to update)

Current text correctors: 152
Known WK users: 1

Total lines corrected: 563,149
WK lines corrected: 737

Ranking positions of WK correctors: 30 [GearAid]

7 Likes

Full List
(note: only newspapers accessible from outside of Stanford are listed; )

WIP List of Newspapers with Legible Furigana Hawaii Times (recommended) Asahi (1915) Dōhō Heimin Shinsekai Asahi Shinbun Sōkō Bunko Sōkō no Shiori (note: full furigana on part of the text) Sōkō Shinbun Yuta Nippō (note: partial furigana on all text) Hawai Asahi Hawai Bun’en Hawai Mainichi Hawai Nichi Nichi Shinbun Hawai Sandei Nyūsu Hawai Sutā (note: only some issues) Jitsugyō no Hawai Kawai Shinpō

Not-so-legible:
Ōshū Nippō
Rokkī Nippon (does have one or two less-blurry issues)
Shinsekai
Shinsekai Nichinichi Shinbun
Taihoku Nippō
Gyōshō
Hawai Shokumin Shinbun
Hawai Hiro Shinpō

WIP List of Newspapers with English Sections Nichibei Shinbun Nihonjin Shikago Shinpō Hawai Mainichi Hawai Shōgyō Hawai Sutā Hawaii Times Jitsugyō no Hawai Kawai Shinpō
Note to self: currently on Hawaii: Kazan
3 Likes

Something seems to have gone wrong with your Asahi link, just fyi. The others seem OK.

2 Likes

Thanks! I messed up the link formatting

2 Likes

This one has three different types of (what I think are) emphasis markings. They really mess up the OCR lol

Screenshot 2024-04-07 at 5.09.02 PMScreenshot 2024-04-07 at 5.09.13 PMScreenshot 2024-04-07 at 5.09.21 PM

There is also one section with furigana but I’m debating whether to list it since it’s so short

3 Likes

Today’s word: 提灯記事(ちょうちんきじ)“puff piece (in a newspaper, etc.); flatteringly exaggerated article”

Random screenshots of hard to read characters

From here
Screenshot 2024-04-08 at 11.14.23 AM
This is probably a か?
Screenshot 2024-04-08 at 11.24.22 AM
This one seems a bit impossible to figure out though lol
Most likely 書いてある

Screenshot 2024-04-08 at 11.48.46 AM
The first one here is an absolute mystery to me. Preceded by 日本文化

I’ve been leaving some kanji as :question: so that I know to come back to them and look them up properly. I don’t think it affects the OCR until the whole box is marked correct? Could be wrong

3 Likes

It was actually the easiest to me, as I recognized the word.
藝(芸)能團(団)体

And I wonder if it isn’t also 團体 on your 1st image (with bad inking on the horizontal strokes)

4 Likes

Thanks! It is 團体, yeah

飜譯(翻訳)
Screenshot 2024-04-10 at 1.28.31 PM

The space between 飜 and 譯 doesn’t seem right, like if the top of 譯 had been ripped off.
It sure doesn’t help