Hi everyone !
I released a website recently to help people read Japanese text : https://jpaste.me/ . It didn’t attract much attention (I didn’t do much to popularize it either). So I decided to release all my data as a GraphQL API : https://rapidapi.com/rlemaigre/api/japanese-text-analysis/details. If you login you can try out the example queries.
GraphQL is awesome. You may learn about it here : https://graphql.org/.
With this API you can get word translations, kanji reading and meanings (including statistics about readings), kanji images (calligraphy and stroke order) and text morphological analysis. It can break down a Japanese text into chunks of characters (tokens) and give you the dictionary form (lemma) associated to each chunk.
It is free.
I hope you enjoy it and don’t hesitate to share !
May I suggest using UNDIC instead of IPADIC?
Could you please elaborate ? I do not know Japanese myself…
Morphological Analyzers like Kuromji use pre-made dictionaries to get the readings of kanji.
IPADIC is the default one because it’s much smaller. However, this means it doesn’t contain a lot of common vocabulary. Even ひとり is not there.
In contrast, UNIDIC is much larger and contains more common vocabulary. You would probably need to follow the links here and download the dictionary file and change the maven settings in your kuromoji.
But Google Cloud Functions can be written in Java as well (I did it in JS), so I may use the original Kuromoji with unidic…might be faster than the JS port as well.
Ok it uses unidic now. I can’t judge by myself if it is any better because I don’t know enough Japanese.
The website actually uses the GraphQL API now. Anything you see in your browser is extracted from a single endpoint that is free for anyone and can be used to build any similar tool. Playground is here : https://jpaste.me/graphiql or on RapidAPI. Here are a few simple queries :
- Ten most frequent kanji with their meanings, readings (with statistics) and the three most frequent words : query
- Ten most frequent words with frequencies, translations, example sentences and a calligraphy image (svg in base64) for each character : query
Everything is located on a much better server than before (I quit Google Cloud and rented a small VPS in Germany) so it’s faster (at least from here in Europe).
I hope anyone can make good use of any of this.
Now I’d like to build a Chrome extension that does basically the same as the website but on any web page. Such extensions already exist, but maybe some people will prefer mine…we’ll see
The coverage is definitely better now! 一人 and 牛丼 are now working!
I prefer the look of your popup over existing ones like Yomichan, Tenten, and Rikaichamp. Rikaichamp is the ugliest looking of them all, and images are still wonky in Yomichan. I like that Japanese.io has example sentences, but Japanese IO requires you to login and use their platform. I prefer Anki and Yomichan instead.
Just a side note , f you’re deciding to make one - for the parsing you want to use a method called “forward scanning”. This is so you can have multiple entries when you scan a word - so when someone scans 牛丼, they can also see the definitions of 牛 and 丼.
Forward parsing requires a deinflector. You can check the deinflector implementation in Yomichan and Rikaichamp for reference. It involves extracting the rules and patterns in Japanese to extract from one phrase all possible entries in the dictionary. Here is an example of a deinflect file. Another example used in Rikaichan.
Oof, not very phone friendly.
I just tried a random sentence (literally the next one in the book I was reading), and only a few things got turned to blue. Also, why does the furigana of 始まった include まっ (and why is the つ in katakana? )
Thank you for your comment. I’ll definitely look into forward parsing
I don’t intend to make this usable on phones for now.
Can you paste me the sentence above as text, rather than image, so I can try it out myself and look into what is happening ? Thank you
If you just made it possible to scroll around, it would be fine
Been testing it again today. Could you overwrite mistakes for common words like 私(わたし) and 女(おんな) ?
Speaking of which, the bugs I reported 2 months ago (failed to parse 明け, well 明け切る technically, katakana ツ as furigana for the hiragana つ) are still here
Ah, and the dictionary doesn’t know words like 箱船 and パーティー. I can look them up if I want to, but that kinda defeats the purpose of the tool
Sorry I’ve been lazy lately. I’ll post in this thread whenever the bugs are fixed (if they can actually be fixed).