New GraphQL API with dictionary, kanji dictionary, SVG images and morphological analyser

Hi everyone !

I released a website recently to help people read Japanese text : https://jpaste.me/ . It didn’t attract much attention (I didn’t do much to popularize it either). So I decided to release all my data as a GraphQL API : https://rapidapi.com/rlemaigre/api/japanese-text-analysis/details. If you login you can try out the example queries.

GraphQL is awesome. You may learn about it here : https://graphql.org/.

With this API you can get word translations, kanji reading and meanings (including statistics about readings), kanji images (calligraphy and stroke order) and text morphological analysis. It can break down a Japanese text into chunks of characters (tokens) and give you the dictionary form (lemma) associated to each chunk.

It is free.

I hope you enjoy it and don’t hesitate to share ! :slight_smile:

15 Likes

May I suggest using UNDIC instead of IPADIC?

4 Likes

Could you please elaborate ? I do not know Japanese myself…

The morphological analyser I use is Kuromoji (a javascript port of it actually) and I just used the default settings.

Morphological Analyzers like Kuromji use pre-made dictionaries to get the readings of kanji.

IPADIC is the default one because it’s much smaller. However, this means it doesn’t contain a lot of common vocabulary. Even ひとり is not there.

In contrast, UNIDIC is much larger and contains more common vocabulary. You would probably need to follow the links here and download the dictionary file and change the maven settings in your kuromoji.

For the javascript port, iirc only the IPADIC version is supported. If you are using Node, you might want to try out the js port of Mecab.

3 Likes

There is no javascript ports of Mecab unfortunately. What you are referring to is just a wrapper…you need to have Mecab binaries installed for it to work. As far as I know, this makes it unsuitable for deployment on Google Cloud, which is what I use right now.

But Google Cloud Functions can be written in Java as well (I did it in JS), so I may use the original Kuromoji with unidic…might be faster than the JS port as well.

1 Like

Ok it uses unidic now. I can’t judge by myself if it is any better because I don’t know enough Japanese.

The website actually uses the GraphQL API now. Anything you see in your browser is extracted from a single endpoint that is free for anyone and can be used to build any similar tool. Playground is here : https://jpaste.me/graphiql or on RapidAPI. Here are a few simple queries :

  • Ten most frequent kanji with their meanings, readings (with statistics) and the three most frequent words : query
  • Ten most frequent words with frequencies, translations, example sentences and a calligraphy image (svg in base64) for each character : query

Everything is located on a much better server than before (I quit Google Cloud and rented a small VPS in Germany) so it’s faster (at least from here in Europe).

I hope anyone can make good use of any of this.

Now I’d like to build a Chrome extension that does basically the same as the website but on any web page. Such extensions already exist, but maybe some people will prefer mine…we’ll see :slight_smile:

4 Likes

The coverage is definitely better now! 一人 and 牛丼 are now working!

I prefer the look of your popup over existing ones like Yomichan, Tenten, and Rikaichamp. Rikaichamp is the ugliest looking of them all, and images are still wonky in Yomichan. I like that Japanese.io has example sentences, but Japanese IO requires you to login and use their platform. I prefer Anki and Yomichan instead.

Just a side note , f you’re deciding to make one - for the parsing you want to use a method called “forward scanning”. This is so you can have multiple entries when you scan a word - so when someone scans 牛丼, they can also see the definitions of 牛 and 丼.

Forward parsing requires a deinflector. You can check the deinflector implementation in Yomichan and Rikaichamp for reference. It involves extracting the rules and patterns in Japanese to extract from one phrase all possible entries in the dictionary. Here is an example of a deinflect file. Another example used in Rikaichan.

1 Like

Oof, not very phone friendly.

I just tried a random sentence (literally the next one in the book I was reading), and only a few things got turned to blue. Also, why does the furigana of 始まった include まっ (and why is the つ in katakana? :sweat_smile: )

3 Likes

Thank you for your comment. I’ll definitely look into forward parsing :slight_smile:

1 Like

I don’t intend to make this usable on phones for now.

Can you paste me the sentence above as text, rather than image, so I can try it out myself and look into what is happening ? Thank you :slight_smile:

まだ夜が明け切らぬうちに箱船のパーティーは始まった。

1 Like

If you just made it possible to scroll around, it would be fine :sweat_smile:

1 Like

Been testing it again today. Could you overwrite mistakes for common words like 私(わたし) and 女(おんな) ?

1 Like

Speaking of which, the bugs I reported 2 months ago (failed to parse 明け, well 明け切る technically, katakana ツ as furigana for the hiragana つ) are still here :sweat_smile:
Ah, and the dictionary doesn’t know words like 箱船 and パーティー. I can look them up if I want to, but that kinda defeats the purpose of the tool :sweat_smile:

2 Likes

Sorry I’ve been lazy lately. I’ll post in this thread whenever the bugs are fixed (if they can actually be fixed).

1 Like