Reader API

A reader library for languages with a few irregularities.

PEG-based deconjugator and grammar recognition tool. Todo list:

  • E-J (offline), J-J online word lookup
  • furigana and part of speech tag for each word provided by text segmentation libraries
  • applied conjugation rules; other reader tools display this in a pop-up dictionary
  • URL links to online, lower intermediate grammar resources
  • support *.txt files

Installation

  • download the reader.jar file from the shared Mega folder, double click to execute or run it from command line
    • java -jar reader.jar
  • requires no permission, the program should’ve limited functionality without internet connection
  • open the browser, go to http://localhost:4567/

Download link (proof of concept)
Please do not reupload the content to public repositories.

How is it different from MeCab?

  • the tokenizer can be used on client or server side, that’s up to you
  • improved deconjugator
  • import new words / phrases / proper names from a json file, you don’t have to rely on a fixed word list
  • direct web links up to 300 grammar points, one click away from the word list

FAQ

How does it differ from other reader tools

  • It started as a library, now it has a graphical interface. Anyone can develop their own UI, spaced repetition algorithm.

Is it for everyone

  • The reader is for beginner, lower intermediate language learners. Fluent speakers, translators cannot make any use of it.

Why now

  • I’d more time in pandemic. Competing products have delays after delays.
    I’m a slow learner, I had to use different apps to keep my language library in one place. But so far, none of them were connected.
    I cannot spend enough time to develop UI, since releasing a web/phone app is an unrealistic goal.

Disclaimer:

  • This software helps you to lookup words and grammar from textbook examples. It can help you to find the information faster in your daily studies. The algorithm cannot help you to read native materials, blogs, visual novels with zero knowledge.
4 Likes

Sorry, but I’m a little confused… xD Don’t existing tools (like yomichan and Kitsun’s reader) already do something like this? Even Jisho parses sentences.

1 Like

Afaik they don’t. They can recognize words, conjugated verbs, and that’s it.
This API verifies the sentences by grammar rules, then it gives back every information about the sentence. These sentences can be categorized later by grammar points, word difficulty, so you find related sentences in lower intermediate resources.
Sorry for the confusing parts XD

1 Like

I like the idea and I’d be really interested in seeing how you’d detect the grammar correctly.

I previously gave it a few shots to implement a grammar detection feature, but it becomes really difficult to recognise some of the more advanced grammar points. Like you’d have to have a list of all possible inflections and conjugation combinations for a grammar point and some are split up through one or more sentences (like they have words in between). Those cases dashed my hopes rather quickly.

I think that if you have a lot of knowledge about grammar and a lot of time, you could probably make it work though!

Regarding the above comments: Kitsun does break down verb conjugations in detail

1 Like

Regarding the above comments: Kitsun does break down verb conjugations in detail

It does, kudos for the nice UI, but why stop there :slight_smile: We can do it!

Like you’d have to have a list of all possible inflections and conjugation combinations for a grammar point and some are split up through one or more sentences (like they have words in between).

That’s how it works. I don’t have to write the code for grammar detection library, these is a “grammar formula”, which generates the source code. The generated algorithm tags these words, grammar points in the sentence, then I can convert these to a human friendly API output.

Those cases dashed my hopes rather quickly.

It took me a few days to figure out, how to optimize the lookup to make it work with unlimited vocab. That’s my first obstacle.

I think that if you have … a lot of time

I don’t need to code, that’s how I save time. It’s faster to write a formal grammar for computers, than writing code.

Thank you for the feedback.

1 Like

I was thinking that too. Sentences in language can literally be defined by a formal grammar. Granted, the rules can be quite complex, but it still seems simpler to define it as a ruleset than using more complicated programming methods.

3 Likes

Good luck writing all possible variations of informal speech, though. :frowning:

No one said it had to be perfect. :stuck_out_tongue:

1 Like

Colloquial Japanese is not fully supported, unless you can help me out with related articles and other resources. I need all the help I can get. :upside_down_face:

2 Likes

Not fully supported is fine with the proper disclaimer, as far as I am concerned.
I do not know of any specific resource for informal language. It’s more something that I learned through experience.
To improve the API on that, I would just test it on something heavy with slang (e.g. a 男性向け light novel), then fix the problems iteratively :woman_shrugging:

2 Likes

What am I supposed to look at? I don’t see anything in the original post other than an empty details section. :confused:

I though you already saw the opening post, I took it down, I didn’t expect further interest in the project. Here is an update. There is a front-end tool in development, that could possibly recognize grammar in textbook example sentences (this would be the first goal). There will be a Reader API, it would provide a back end for any closed sourced project / language learning site.
The software finds and provides links to the grammar points found in the sentence.


I need beta testers later. It can help lower intermediate learners to understand the grammar in any simple sentence taken from a textbook. It should recognize not only the conjugation forms, but the grammar patterns as well. I compile the lower intermediate Japanese grammar into a ruleset, which can be used by a computer algorithm.
There are only 200 sentences in the test system, I need to parse 3000 examples before I can release the source code to everyone.

If you have suggestions, or any past experience with similar tools, I’m looking forward to hear about it. I’m trying to overcome challenges by myself meanwhile :slightly_smiling_face:

2 Likes

I have close to 5700 texts on Yomi.ai that could use this. :slight_smile:
Will keep an eye on this, and please keep me posted.

I need a bug tracking page to support those titles. I don’t know how many of you have time to fill out forms, but without categorization software bugs remain undetected. It would be probably better (months later), if you could integrate a similar modal window on your site, and provide a link to the csv export.
image

It would take a week to generate an automated report for 5700 texts, it had to be done every time I update the algorithm. Test cases only include sentences from grammar books, it helps me to decide (without delay) if there is a collision between the old rules and the new one.

Progress update:

The algorithm doesn’t assign weight (score points) to words, because it’s not based on probabilities. I should prioritize hiragana-only words (ちょっと) over particles and other suffixes.

In this example the grammar would be:
Noun + Particle + (?= Adverb or Adjective or Verb ), where the (?= ) expression is the positive lookhead. I will write a function to cover these cases, unless someone has better idea.

2 Likes

August update

I’ve been struggling with the grammar definitions for months. It was doable, but the requirements were missing:

  • Japanese text corpus, it would take a year to build a simple one.
  • Correct grammar definitions and example sentences to verify their correctness.

This struggle helped me to understand the Japanese language more than ever before. I decided simplify the program, and with the knowledge I’ve from the past months I rewrote the scripts in a few days.

It should cover the grammar from JFZ 1-5 textbooks, none of the grammar points are linked to the books’ chapters.

The software:

  • has a few hundred example sentences. With the improved algorithm I can add a thousand sentences in days, rather than weeks.
  • adjective and verb deconjugator have 100 LOC ruleset (inspired by Japanese Computational Lexicon: A Computational Dictionary Of Japanese Verb Forms)
  • compressed size is 200KB (excluding the dictionary files)
  • algorithm behind the deconjugator and pattern recognition is four times faster than it was **
  • grammar recognition is a multistage process, the first layer provides hyperlinks to the recognized patterns, other layers provide additional information with performance penalty (see description under screenshot)

The grammar file is not a dataset, but a generated computer algorithm, a chain of boolean expressions. The algorithm scans the text to detect known words.
** Finding inflected form and position of occurrence (or the absence of word) is a resource intensive task, every failed attempt is stored in a short term memory. Unmatched patterns are evaluated only once at each character position, it won’t do these pointless tasks again.
Memory reset occurs after each sentence.

Before (left), after (right):

The previous version highlighted the Noun + ni + Ageru pattern. The simplified script cannot do this anymore.
image

The previous version recognized the Noun + ya + Noun pattern, while the simplified version cannot process these particles.

With enough time, I could bring back these features.

Did you mean to post this in the top post? It is empty now. It makes it difficult to understand what this thread is about.

Learn new words and grammar from context. This is how to use the “demo.html” file:

  1. Copy files from the download link
  2. Execute reader.exe (includes a proxy server to access online dictionaries)
  3. Find a novel on estar.jp
  4. Replace the domain name to localhost: http://127.0.0.1:8000/novels/25637723/viewer?page=4
  5. Include this in your own webapp with better UI

Alternative method, how to use “index.html” file, you don’t need the executable files:

  1. Copy files from the download link
  2. Open command line, change the current directory
  3. npm install -g local-web-server
  4. ws
  5. Browse http://127.0.0.1:8000/

1 Like

I’ll upload the reader.jar file later. Uploaded, see the OP above.
Here are the screenshots, it displays a Wikipedia article.

Vocabulary list for each sentence (common particles are excluded):

Monolingual dictionary:

English translation:

Kanji info, sorted by reading:

Grammar reference:
image

1 Like

One more thing. Here is the plan for then next few weeks:

  • parse short text files from any source. Done
  • the reader library has web links for 300 grammar rules, I plan to add another 200 rules. Done

Long term plans:

  • Find the same word, grammar from your text files. If you download and keep only those articles you’re familiar with, you can refer to them later, the software find those references for you.
  • Include more test cases.
  • Don’t break the multi-platform support.

Update (a week later):

Added a thousand sentences for the test cases.

I rewritten part of the deconjugator and the text segmentation, now it supports custom vocabulary.
The reader tool can open and read Unicode plain text files in your file system, there is a popup dialog to choose a text file.

I have another idea, but I have to write tests before I publish the update. Suppose you have two files in your selected directory:
novel.txt
novel.json

The software could search for a *.json file. It contains rare / unrecognized words from the novel:

[ { “word”: “そうすれば”, “pos”: [ “conj” ], “reading”: “” } ]

Edit the *.json file (or download it from somewhere), after a page reload the software should recognize these new words.

2 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.