I released a website recently to help people read Japanese text : https://jpaste.me/ . It didn’t attract much attention (I didn’t do much to popularize it either). So I decided to release all my data as a GraphQL API : https://rapidapi.com/rlemaigre/api/japanese-text-analysis/details. If you login you can try out the example queries.
With this API you can get word translations, kanji reading and meanings (including statistics about readings), kanji images (calligraphy and stroke order) and text morphological analysis. It can break down a Japanese text into chunks of characters (tokens) and give you the dictionary form (lemma) associated to each chunk.
Morphological Analyzers like Kuromji use pre-made dictionaries to get the readings of kanji.
IPADIC is the default one because it’s much smaller. However, this means it doesn’t contain a lot of common vocabulary. Even ひとり is not there.
In contrast, UNIDIC is much larger and contains more common vocabulary. You would probably need to follow the links here and download the dictionary file and change the maven settings in your kuromoji.
For the javascript port, iirc only the IPADIC version is supported. If you are using Node, you might want to try out the js port of Mecab.
There is no javascript ports of Mecab unfortunately. What you are referring to is just a wrapper…you need to have Mecab binaries installed for it to work. As far as I know, this makes it unsuitable for deployment on Google Cloud, which is what I use right now.
But Google Cloud Functions can be written in Java as well (I did it in JS), so I may use the original Kuromoji with unidic…might be faster than the JS port as well.
Ok it uses unidic now. I can’t judge by myself if it is any better because I don’t know enough Japanese.
The website actually uses the GraphQL API now. Anything you see in your browser is extracted from a single endpoint that is free for anyone and can be used to build any similar tool. Playground is here : https://jpaste.me/graphiql or on RapidAPI. Here are a few simple queries :
Ten most frequent kanji with their meanings, readings (with statistics) and the three most frequent words : query
Ten most frequent words with frequencies, translations, example sentences and a calligraphy image (svg in base64) for each character : query
Everything is located on a much better server than before (I quit Google Cloud and rented a small VPS in Germany) so it’s faster (at least from here in Europe).
I hope anyone can make good use of any of this.
Now I’d like to build a Chrome extension that does basically the same as the website but on any web page. Such extensions already exist, but maybe some people will prefer mine…we’ll see
I prefer the look of your popup over existing ones like Yomichan, Tenten, and Rikaichamp. Rikaichamp is the ugliest looking of them all, and images are still wonky in Yomichan. I like that Japanese.io has example sentences, but Japanese IO requires you to login and use their platform. I prefer Anki and Yomichan instead.
Just a side note , f you’re deciding to make one - for the parsing you want to use a method called “forward scanning”. This is so you can have multiple entries when you scan a word - so when someone scans 牛丼, they can also see the definitions of 牛 and 丼.
Forward parsing requires a deinflector. You can check the deinflector implementation in Yomichan and Rikaichamp for reference. It involves extracting the rules and patterns in Japanese to extract from one phrase all possible entries in the dictionary. Here is an example of a deinflect file. Another example used in Rikaichan.
I just tried a random sentence (literally the next one in the book I was reading), and only a few things got turned to blue. Also, why does the furigana of 始まった include まっ (and why is the つ in katakana? )
Speaking of which, the bugs I reported 2 months ago (failed to parse 明け, well 明け切る technically, katakana ツ as furigana for the hiragana つ) are still here
Ah, and the dictionary doesn’t know words like 箱船 and パーティー. I can look them up if I want to, but that kinda defeats the purpose of the tool