Any source to parse Japanese texts into ready-to-use vocabs?

TL;DR: I want to find a working app or script to output japanese vocab from input texts. A typical text has about 1000~2000 words.
(Optional: if it could return the meaning of vocab as well is really appreciated)

Is there any app out there or easy to use tool/script to output japanese texts into vocabulary ? or just rough csv/excel data that has a column with vocabs, another with definition(optional).

I’ve tried Yochimu but it seems to not working now. I also tried to run the script from this blog (but failed cause I couldn’t install Nagisa module on my Window).

Text parser on github sounds good ( but I couldn’t make them working as I want (due to my technical limitation I guess)

edit: I’m trying to use mecab (kinda a bit outdated but still good). well mecab is working now but it’s not something that’s easy to take away… (many unecessary vocabs and grammars terms that have to be dealt with…)

Any idea or tip is really appreciated. Thank you!

Disclaimer: well I don’t know somehow up until now the content sounds like a marketing post for kitsun but please kindly be noted that I’m not endorsed by kitsun in any way. the flow just happens to lean towards Kitsun. In my opinion the app is kinda worth a try and I’m actually thinking about subscribing.

1 Like

Kitsun’s reader can parse it for you in a very straightforward way, but it won’t make the list that you want. What it does is that it allows you to upload any text, parses it for you, and as you go on reading it you can generate cards from the words you don’t know.

4 Likes

thanks! Kitsun is getting better and better. maybe I should suggest my idea to @neicul (and happy to pay for it ever after xD)

The idea is like to have a list of hard vocab of the texts to learn beforehand, then read. It’s kinda similar to the idea of the author in this blog: Extracting vocabulary words from a Japanese text | The blog of a right leg

it would be nice that when I want to read a text, I would be given a list of words appearing in that text, so that I could study it beforehand.

3 Likes

There’s jReadability: https://jreadability.net/sys/en

It generates a vocab list with frequency info as well, but I don’t know how reliably it parses Japanese text (haven’t used it myself).

1 Like

Thanks for the tag :smiley:

I’m in the middle of implementing a global vocabulary table to the reader in Kitsun. In that table you’ll be able to sort/filter on frequency, jlpt level and some other things for all the vocabulary of a text/book. Pretty much exactly the functionality you are describing, for whichever text of book you throw in there :smiley:

If you know how to work with the dev console in the browser, you can currently already view all unique vocab and see their stats:


(You can right click it, save it as a variable and then use that variable to sort/loop through the list or whatever you might want to do with it, not sure how technical you are ^^)

Regarding Mecab, it’s nice to use as a basis, but requires a lot of extra parsing afterwards with grammar rules based on the token types to make sure certain words get seen as one word, rather than 3-4 words (忘れています could parse as 忘れ - て - います to give a simple example, it’s way more than just conjugations).

3 Likes

that’s crazy! well so maybe I will put my ultimate suggestion (≧ω≦)

you can see a demo in this site https://www.japanese.io/vision (requires an account)
imo there are many users who like to read manga (and just learn Japanese to read manga) and many of the manga sources come in the form of images.

So, if kitsun can do something like input the manga images then the app can ORC the image. Basically it’s Reading assitance tool but instead of text it’s in the image “texts”. (you can see how it works with the link above).

ps: anyway I just come across this Japanese resources review and maybe this will give the kitsun team some motivation to boost the reading assistance to the next level

Although Kitsun.io is positioned as a user-friendly replacement for Anki, it’s actually the Japanese “Reading Assistance Tool” that captured my attention. Copy Japanese text from any source (or type content manually), and Kitsun.io will process it, adding furigana and granting you the power to instantly generate flashcards with the click of your mouse button. This tool is an excellent way to work through the first three free chapters of Iwata-san . As of 2021, Kitsun.io has joined Wanikani and Bunpro to form the trinity of Japanese learning apps that I use daily.

2 Likes

really nice!! the JLPT is not there yet I think but it shows the vocab in elementary or lower advanced which is cool. So far the best, easy to use resource that comes close to my intention

1 Like

I think OCR would also be cool to implement into the reading tool eventually, but is very difficult to do well. The ones I’ve tried so far often don’t recognise the text well enough due to various factors. But in general it would be really cool to add this yeah! :grinning_face_with_smiling_eyes:

I wasn’t aware of this article, that’s super awesome! Thanks for sharing! :smiley:

1 Like

sure, I’m glad I could be a little help.

anyway for the OCR thing maybe a reference is from GitHub - sakarika/kanjitomo-ocr: Java library for identifying Japanese characters from images, which is the ocr code of kanjitomo published by the author. Another version in Rust is here

2 Likes

I wrote a small script based on the idea in this post here:

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.