That’s a good point actually - it was mostly trained on Wikipedia and Japanese websites which tend to use more formal language, so it doesn’t recognise colloquial speech as well… I could mine a whole bunch of Japanese tweets and add those to the training data as examples of casual language, but then would those tweets just have terrible grammar? Dilemma
Yeah, probably have to choose how to address that eventually. Something like ら抜き言葉 (leaving the ら out of something like 見られる and just saying 見れる when the meaning is potential form) is totally normal, but it will not be accepted on a conjugation test as correct.
2 models could be an idea, and then to let users select whether they want to check more formal, or more casual grammar. That could also give better performance on each type of text. An ideal model would figure out the formality from the rest of the text and judge it appropriately, but I might be limited on data and computational power to do that… hmm. Good food for thought.
I am going to guess there are likely more issues than this that will need to be confronted. As the Japanese Wikipedia article on Grammar Checkers seems to suggest it is an English thing and there are no notable Japanese grammar checkers.
Anyway, I fed it some text. And I see the dotted red line, but I also have a single solid red line under one character and I’m not sure what that is.
Also does 飛ばす here mean “skip”? I’m not an expert in computer terminology, but I feel like there’s a better word to be used there.
Yeah, grammar checking is still very much an unsolved problem in Japanese. Though note that what my site is doing is just detecting errors, which is a lot easier than also suggesting the correct grammar like you might get in MS Word if you right click. There are some research groups like Nara Institute for Science and Technology that have been working on this for a while, their research papers helped me figure out this model actually, but the field still a lot of scope for improvement.
Anyway - solid red line means an error. Dotted line means the line wasn’t checked for some reason - it could be too short, or too long etc. If you mouse over or tap the dotted underlined text, it will tell you the problem.
I think 飛ばす is the best word for the job! A couple of my Japanese native friends thought it was too. But open to suggestions if anyone knows the nuances on this!
I would suggest picking more standard colors for things. Basically red for bad, green for good, some other color for neutral. Right now there’s purple, pink, blue, and green, so it’s not super easy to follow.
Oh I didn’t know about 小説家になろう ! What a great resource! Thanks for suggesting it. And with regards to ため息 - it actually does work with 溜息, so it could be a formality issue again or just the data it’s been exposed to.
I am trying to think of what my computer uses for Ignore / Skip, But I can’t for the life of my think of something I can do in two seconds to make a confirmation box pop up with that specific option in it. I have a strong feeling it is just something like スキップ or some other Katakana though.
That makes more sense about the dotted line. I didn’t realize it wasn’t checking it because it was too long, I was under the impression that that is what it thought was an “error.” But now knowing you’re not suggesting things like that it makes more sense. Also now I read the whole sentence it gives me. What’s interesting then is where it seems to cut off then, cause it’s only highlighting about half of the sentence, then saying it’s too long. I do agree with the above that maybe just changing the color would help. I’m already conditioned for red squiggly line or something that looks similar means mistake.
But now I understand most of what it is giving me. It seems to not like 伸ばし which is unsurprising. But in this sentence 令和三年四月一日から新テキストの内容を実施する。 it doesn’t like を実施する and I’m kinda at a loss as to why it thinks so.
Hmm, well if a message pops up with the Japanese for Skip on there, then please do let me know and I’ll update the site accordingly.
Yeah, it’s a bit confusing where it cuts it off; essentially Japanese AI models work by splitting sentences into little chunks called ‘tokens’ which can be of varying length; my model has a maximum capacity of 48 tokens. It’s possible to make models with much more capacity, but training them requires beefy computers and graphics cards, I’m just doing this in my bedroom. I also really wanted squiggly lines but they are unfortunately unreliable on Japanese text maybe a few Chrome/Safari updates from now they’ll improve.
For that sentence, I can’t see anything that’s wrong either. I’ll just say it’s still early days for the model, and hopefully with more training, errors like that will become less and less frequent. Thanks for trying it out and for your feedback.
Just to say, I’ve been using this with my writing practice and it’s helping me a lot. Until now it’s been difficult to identify and self-correct errors, so this really helps my self-review. Thanks for creating and sharing.