Lapis app: a segmentation engine that understands grammar, the best dictionary lookups, SRS designed for language learning

Hey, thanks!

While WK and Lapis has nothing in common (Lapis doesn’t solve the Kanji problem), Lapis and Bunpro do (grammar reference and SRS). So I don’t think I’m comfortable advertising for it there out of respect for Bunpro :sweat_smile:

Also, we chose to talk about it only in the WK community right now to limit the number of users as it’s still in beta. After we get a lot more confident with it we’ll be widening that area.

4 Likes

I made something similar back then, I’m glad someone has a more mature version.


I had hard time with the available open-source deconjugation libraries, so I had to make my own. But yours provides more detailed information. The grammar details for “に対して” is impressive, but for other example sentences, “アメリカに留学していただけのことはある。” both the grammar lookup and the parser returns incomplete results.
image
Since there is no way to know every word, I wish I could look up words / segment text in J-J definitions - without interrupting the workflow -, that’s a deal breaker before I use this webpage.

I cannot import books on the site, but with the SRS tags, that’s ok for now. Well done.

2 Likes

Looks great!

I haven’t tried it out myself yet, but i do have a question. Since each sentence is segmented into words/grammar, does Lapis track which words and grammar you know? If so, does it have the capability of selecting/suggesting “i+1” sentences from a database (user created or not) for study?

If not, personally such a feature would take Lapis to a level far beyond any other software I’m currently aware of.

4 Likes

There’s an upcoming major update to the engine that will fix most of those issues. (It’s harder for some sentences because we detect all kinds of complex grammar patterns, so this leads to a lot of complexity on the engine layer.)

Because we want rich grammar detection and not just word segments, in this example sentence you have, it’s a matter of registering the だけのことはある grammar point and it’ll detect it correctly. We plan to focus on registering new grammar right after we finish with the segmentation engine update.

You can, if you select text and right click:
image

It’ll show you the result without leaving your current page. Does this solve what you have in mind?

There are thoughts around this subject (i+1 taking into account words and grammar the user knows). It’s definitely something we want to look into if the project lives on.

2 Likes

Neat! Thanks for the reply!

For when that discussion comes to the forefront, the use case I’m personally envisioning for such a feature is something along the lines of:

I provide Lapis with a transcript of an episode I just watched. Using all the words and grammar in sentences already in the SRS system above a certain level (and/or manually marked as “learned”), Lapis adds any sentences that have new words/grammar to some “To learn” queue. This queue has options for being sorted chronologically, by the +1 “unlocking” the most other i+1 sentences, etc.

Hopefully that makes sense, and thank you for your hard work!

4 Likes

Thanks, this really helps!

I had a very similar workflow in mind too.

2 Likes

Hey everyone! We have a few updates.


Segmentation overhaul is now live, you’ll find that accuracy has been improved significantly. We’re very confident with the current state of the engine but we’re planning further improvements and few new things. We do accuracy comparisons with mecab (the best open source morphological analyzer), and we already have richer and more accurate results in many cases.

We also added some basic detection of names, to be improved upon later:


We have a more succinct representation of the grammar matching forms now:

Before:

After:


A grammar module is now available. You an use it to browse the registered constructs and grammar the engine recognizes.

You can also open this from anywhere in the app without leaving the current page if you hit Ctrl+G.
Not all grammar docs are written, we’ll be filling these up with time.


This is a major step, but we still have more to do. Next step is registering more and more grammar (which will implicitly improve accuracy), and filling up the grammar documentation.

Give it a try!

12 Likes

Good work!

By the way, is there any way to send feedback through the app or will there be one? Thanks!

1 Like

Have you compared mecab with Juman++? I’ve found better results with the latter, but it also comes with a large database (over 2 GB), as it’s using things like Wikipedia page titles. I know Juman++ also has a way to “train” it. I don’t know if mecab has the same, or if that feature would be of use in the backend of your project.

I’ve been waiting for this overhaul to go live before I dive into trying it out further. Looks like I get to start playing with it soon.

3 Likes

Later yes, there will be a way to submit those. But for now feel free to comment on this thread for any kind of feedback.
We’ll also open some discord channels for the different modules soon (link in the OP), but for now this thread is the preferred place.

1 Like

Yeah basically all of these use trained models on a particular dic. Usually the default installation will get you better results with juman++ than mecab since mecab uses the outdated ipadic by default (but the most popular). The unidic one is the most up-to-date but it’s also huge. I haven’t played with juman++ much but I suspect the result is comparable to mecab+unidic (except for maybe wiki/names detection, but I’m not very interested in that).

Actually, the lapis engine uses a trained model (mecab like, with ipadic) in a very small step before the engine gets to actual work. The step involves producing word splits. This guides recognizing the most probable boundaries. We won’t gain much from using a better alternative though as that’s not where the actual work happens.

Great!

1 Like

I’ll have to look into this. I was unaware of it!

I have another case of an strange parsing issue in Lapis:

普段は意識にぶら下がることのない確かな重み。

Lapis seems to be believe ことの is a name here, which isn’t the case:

image

EDIT: It does seem to happen with other particle chains that end in の:

日も明るい内からの酔っ払いである

Hm. This is because I try to match names by their Kana. This won’t do. I’ll have to limit it to the Katakana names only, probably. Thanks, will push a fix soon.

1 Like

Should be fixed now. Also fixed a regression with matching some forms (like a na-adj + な・で) that I missed (can be seen in your first example).

1 Like

I keep getting a message when I log in telling me my time zone is incorrect. Lapis provides me with a list of UTC+11 time zones to choose from, none of which are mine. I’m UTC+10 normally (I selected Sydney from the list in settings) but we are currently on summer time.

Thanks for the app! I’m still finding my way around the app, but I like what I see.

This is because the offset in your browser is a UTC+11. Usually this is because your computer is set to this. Can you confirm if the timezone in your computer is set correctly to UTC+10? The timezone set on your computer should match the one in Lapis, which is why we keep prompting.

My iPad date/time settings just says “Sydney” and doesn’t give UTC, but my current TZ is definitely UTC+11 because of summer time.

This might be a problem from what gives us the current time zones and daylight saving times, I’ll double check. The reason we require this is that a mismatch will probably produce wrong daily schedules (since the schedule for the day has to know when your day starts and ends). I’ll confirm when I check.

1 Like