Dependency Parsing for Japanese


#1

Hello all,

this is my first post, よろしくね。

I have been experimenting with different dependency parsers for Japanese recently, and I was wondering if anyone else is interested. Basically you (programmatically) try to split a sentence into its smallest parts, and then figure out what the relationship between the parts is. I think this should be helpful to learn how to parse/read sentences yourself, and make glossing easier.

There are several tools available for this, at the moment I’m checking out Jumanpp and KNP. Example output looks like this:

You read from top to bottom, and things on the left modify things on the right, with the predicate at the very end. Multiple things in parallel can modify a verb.

Has anyone tried to use something like this in the past?


Also, I’m currently building a program that turns KNP output into HTML.

Recent examples:
Above sentence again (NHK news easy)
A few sentences from a normal NHK news story today

You can click the sentence headers (white boxes) to go to another sentence in the second example, hover over units to show a gloss, click a unit to make the hovering item stay when you leave the unit with the mouse again, and toggle some additional internal data with the form buttons. The colors show the grammatical function.

Obviously the usability is not great, and you can’t enter your own sentences in these examples, but maybe it’s helpful. Also, I mainly use Chrome, Firefox and Safari seem to work, but no guarantees. Finally, Jumanpp and KNP are around 90-95% correct according to the guys who make it, so the chance is quite high that the analysis result is not really correct, glosses as well.

Is anyone interesting in a tool like this, or may find it helpful to learn Japanese?


#2

Tremendous. Yes, this helps a lot to refresh words or grammar points. As you said, not 100% correct, so users should take it with a grain of salt. But it’s a pretty small grain of salt.


#3

MeCab is pretty much the defacto standard morphological analyzer. In fact, if I remember correctly, I think maybe KNP uses it on the back end.

That being said, I’m actually building my own. At the moment, I’m restructuring a dictionary to optimize morphological searches.


#4

I think you are remembering CaboCha, it uses MeCab by default. I was also looking at CaboCha, it sometimes gives better results than KNP, but it has less features (like parallel structures). It will be no problem to integrate into my frontend though. You can directly compare the outputs at this web service:
http://langrid.org/playground/dependency-parser.html

In the end I chose Jumanpp because it seems more actively developed, the latest version of MeCab is from 2013. However, I am also looking forward to better support in Syntaxnet/Tensorflow for Japanese, modern AI techniques should improve the accuracy of the parsing even further.

You are building your own? Very interesting! What will the new features be?


#5

@plantron It’s nice to hear that you find it useful, I’m considering to put the interactive version on the web after polishing it a bit further. The knowledge for automatic translations is also moving forward, for example Google Translate already got much better on Japanese input, but you still need a truckload of salt when using it. It also hides other possible options for translation.

I’m thinking of as tools as something halfway between just using Rikaichan et al. where you try to figure out the correct word boundaries and a fully automatic translation, an “assisted self-translation” or something.


#6

It’s not so much for the purpose of new features, though I expect some new things as a byproduct. I’m doing it for a few reasons:

  • Reinventing the wheel gives you a better understanding of how things work. (And it tends to encourage new ideas.)
  • I’m building some language learning tools that need grammar analysis, and the existing morphology tools aren’t well suited to what I want to do. I need access to the analyzer’s internal decisions, and some ability to work interactively.

#7

I understand what you mean, it’s part of the reason I started to play around with dependency parsers. And judging from the number of existing morphological analyzers lots of people felt the same way.

Do you use machine learning for the analysis? What do you think about using machine learning frameworks in that case?


#8

I haven’t gotten far enough to need machine learning yet, but I’ll consider the utility of it as I encounter specific problems.

I’m not entirely sure where to draw the line in defining machine learning. Generally, if you understand a problem well enough, you have already learned the information, and can apply an algorithmic model accordingly. If you don’t understand a problem yet, machine learning can identify statistically significant features, which can be integrated into your model with or without human intervention (e.g. supervised vs unsupervised learning). I suppose the difficulty is in determining at what level of abstraction you call it machine learning versus just an algorithm.

Since I’m aiming for accuracy, I prefer curated knowledge over unsupervised machine learning. But I’m happy to let a machine identify features, then feed them to a human to interpret, filter, and integrate into the model (i.e. human-supervised learning). But as I said, I haven’t gotten to that point yet. :slight_smile:


#9

I feel like posting xkcd comics of going to start becoming my thing…


#10

There’s a lot of truth in that comic :slight_smile:
Supervised learning is like a game a whack-a-mole, where you use negative feedback to beat down any false answers. The key is to make the system smart enough to whack its own moles.


#11

If I understand you correctly you imply that if you understand for example “the rules of Japanese” perfectly then you can parse/translate everything with complete accuracy using a finite number of (predefined) rules. This would mean that we just haven’t understood the rules correctly yet.

On the other hand most text is produced by humans than are only intuitively aware of the rules, and know only a subset of them anyway. Human language is not a formal language, so you will have to react to input you never saw before.

I’m not arguing that machine learning is not overused or sometimes tossed at problems you’re just too lazy to search a proper solution for, but for language processing, image recognition, etc the achieved results show that it’s quite accurate, even though you don’t know exactly what’s going on on the inside.


#12

I’m not really arguing for more or less use of machine learning. I’m mostly just pondering when something is considered machine learning. And my conclusion, my personal definition, is that it’s a linear scale based on how much of the rule-derivation is done by humans versus the computer.


I think the appropriate criterion for choosing when and what to implement via machine learning is:

  • When the balance of 「cost of implementation」 and 「quality of result」 is better when using machine learning for a given portion of the system.

The balance often tips in favor of human-derived algorithms when the rules are well-known and can be codified algorithmically, and that’s where I currently am in my project (at a very early stage).

When I reach portions of the project where I don’t understand the rules well, or at least how to derive them algorithmically, I will consider machine learning. My first instinct is to analyze the problem until I understand it well enough to implement it algorithmically. But that’s not always possible, practical, or efficient.


#13

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.