It started as a library, now it has a graphical interface. Anyone can develop their own UI, spaced repetition algorithm.
Is it for everyone
The reader is for beginner, lower intermediate language learners. Fluent speakers, translators cannot make any use of it.
Why now
Iâd more time in pandemic. Competing products have delays after delays.
Iâm a slow learner, I had to use different apps to keep my language library in one place. But so far, none of them were connected. I cannot spend enough time to develop UI, since releasing a web/phone app is an unrealistic goal.
Disclaimer:
This software helps you to lookup words and grammar from textbook examples. It can help you to find the information faster in your daily studies. The algorithm cannot help you to read native materials, blogs, visual novels with zero knowledge.
Sorry, but Iâm a little confused⌠xD Donât existing tools (like yomichan and Kitsunâs reader) already do something like this? Even Jisho parses sentences.
Afaik they donât. They can recognize words, conjugated verbs, and thatâs it.
This API verifies the sentences by grammar rules, then it gives back every information about the sentence. These sentences can be categorized later by grammar points, word difficulty, so you find related sentences in lower intermediate resources.
Sorry for the confusing parts XD
I like the idea and Iâd be really interested in seeing how youâd detect the grammar correctly.
I previously gave it a few shots to implement a grammar detection feature, but it becomes really difficult to recognise some of the more advanced grammar points. Like youâd have to have a list of all possible inflections and conjugation combinations for a grammar point and some are split up through one or more sentences (like they have words in between). Those cases dashed my hopes rather quickly.
I think that if you have a lot of knowledge about grammar and a lot of time, you could probably make it work though!
Regarding the above comments: Kitsun does break down verb conjugations in detail
Regarding the above comments: Kitsun does break down verb conjugations in detail
It does, kudos for the nice UI, but why stop there We can do it!
Like youâd have to have a list of all possible inflections and conjugation combinations for a grammar point and some are split up through one or more sentences (like they have words in between).
Thatâs how it works. I donât have to write the code for grammar detection library, these is a âgrammar formulaâ, which generates the source code. The generated algorithm tags these words, grammar points in the sentence, then I can convert these to a human friendly API output.
Those cases dashed my hopes rather quickly.
It took me a few days to figure out, how to optimize the lookup to make it work with unlimited vocab. Thatâs my first obstacle.
I think that if you have ⌠a lot of time
I donât need to code, thatâs how I save time. Itâs faster to write a formal grammar for computers, than writing code.
I was thinking that too. Sentences in language can literally be defined by a formal grammar. Granted, the rules can be quite complex, but it still seems simpler to define it as a ruleset than using more complicated programming methods.
Not fully supported is fine with the proper disclaimer, as far as I am concerned.
I do not know of any specific resource for informal language. Itâs more something that I learned through experience.
To improve the API on that, I would just test it on something heavy with slang (e.g. a çˇć§ĺă light novel), then fix the problems iteratively
I though you already saw the opening post, I took it down, I didnât expect further interest in the project. Here is an update. There is a front-end tool in development, that could possibly recognize grammar in textbook example sentences (this would be the first goal). There will be a Reader API, it would provide a back end for any closed sourced project / language learning site.
The software finds and provides links to the grammar points found in the sentence.
I need beta testers later. It can help lower intermediate learners to understand the grammar in any simple sentence taken from a textbook. It should recognize not only the conjugation forms, but the grammar patterns as well. I compile the lower intermediate Japanese grammar into a ruleset, which can be used by a computer algorithm.
There are only 200 sentences in the test system, I need to parse 3000 examples before I can release the source code to everyone.
If you have suggestions, or any past experience with similar tools, Iâm looking forward to hear about it. Iâm trying to overcome challenges by myself meanwhile
I need a bug tracking page to support those titles. I donât know how many of you have time to fill out forms, but without categorization software bugs remain undetected. It would be probably better (months later), if you could integrate a similar modal window on your site, and provide a link to the csv export.
It would take a week to generate an automated report for 5700 texts, it had to be done every time I update the algorithm. Test cases only include sentences from grammar books, it helps me to decide (without delay) if there is a collision between the old rules and the new one.
Progress update:
The algorithm doesnât assign weight (score points) to words, because itâs not based on probabilities. I should prioritize hiragana-only words (ăĄăăŁă¨) over particles and other suffixes.
In this example the grammar would be:
Noun + Particle + (?= Adverb or Adjective or Verb ), where the (?= ) expression is the positive lookhead. I will write a function to cover these cases, unless someone has better idea.
Iâve been struggling with the grammar definitions for months. It was doable, but the requirements were missing:
Japanese text corpus, it would take a year to build a simple one.
Correct grammar definitions and example sentences to verify their correctness.
This struggle helped me to understand the Japanese language more than ever before. I decided simplify the program, and with the knowledge Iâve from the past months I rewrote the scripts in a few days.
It should cover the grammar from JFZ 1-5 textbooks, none of the grammar points are linked to the booksâ chapters.
The software:
has a few hundred example sentences. With the improved algorithm I can add a thousand sentences in days, rather than weeks.
adjective and verb deconjugator have 100 LOC ruleset (inspired by Japanese Computational Lexicon: A Computational Dictionary Of Japanese Verb Forms)
compressed size is 200KB (excluding the dictionary files)
algorithm behind the deconjugator and pattern recognition is four times faster than it was **
grammar recognition is a multistage process, the first layer provides hyperlinks to the recognized patterns, other layers provide additional information with performance penalty (see description under screenshot)
The grammar file is not a dataset, but a generated computer algorithm, a chain of boolean expressions. The algorithm scans the text to detect known words.
** Finding inflected form and position of occurrence (or the absence of word) is a resource intensive task, every failed attempt is stored in a short term memory. Unmatched patterns are evaluated only once at each character position, it wonât do these pointless tasks again.
Memory reset occurs after each sentence.
One more thing. Here is the plan for then next few weeks:
parse short text files from any source. Done
the reader library has web links for 300 grammar rules, I plan to add another 200 rules. Done
Long term plans:
Find the same word, grammar from your text files. If you download and keep only those articles youâre familiar with, you can refer to them later, the software find those references for you.
Include more test cases.
Donât break the multi-platform support.
Update (a week later):
Added a thousand sentences for the test cases.
I rewritten part of the deconjugator and the text segmentation, now it supports custom vocabulary.
The reader tool can open and read Unicode plain text files in your file system, there is a popup dialog to choose a text file.
I have another idea, but I have to write tests before I publish the update. Suppose you have two files in your selected directory:
novel.txt
novel.json
The software could search for a *.json file. It contains rare / unrecognized words from the novel: