How do you Computerized extracting of all vocabs from a paragraph?

I believe it is possible, but how?

The opposite side would be extracting of all grammar points (by frequency).

Could you elaborate a bit? I’m not entirely sure what you want to do. Do you want to extract vocab out of a webpage, or just any paragraph?

I have found an answer for both, though, (but not for extracting grammar.)

http://nihongo.monash.edu/cgi-bin/wwwjdic?9T

So do you want to write your own software that parses a sentence/paragraph and spits out the words and their parts of speech/grammar, or are you looking for a tool that does this? I believe it was discussed in this thread no?

That’s what a morphemizer does. Look up Mecab, RakutenMA, and friends. I believe there are web services that can do that for you as well.

1 Like

If its short enough then Jisho.org can get some of the grammar points. It is still not exactly great.

And actually I am working on a plugin for chrome that would seperate Japanese sentences based on words and phrases. That is quite a tall order though! It’s not as easy as it sounds. There are lots of rules and several that are broken enough that they specifically have to be programmed in there!

I’ll let y’all know if I get something working!

Do you want a visual output for yourself or something in a good machine readable format for further use?

Yes, more preferably machine readable, to be more easily thrown into Anki.

mecab is a pretty good analyzer that does the trick. It comes with a command line interface you can use to get what you want. You will get something like this:

% mecab
すもももももももものうち
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Which basically is a list of words along with their grammatical meaning. But be aware that the results will sometimes have flaws. It likes to fall over some stuff or is too picky.

1 Like

Someone on the forum introduced me to ichi.moe and it has been a very helpful tool for me. It will break the sentence down for you and define each word

1 Like

I believe NLTK(http://www.nltk.org/) has some POS tagging libraries for the Japanese language. You might be looking for something more user-friendly, though…

However, if you can Python (or are willing to learn, which is not as difficult), it would be relatively easy to create a simple program that takes text as an input and generates a CSV file for use on Anki

1 Like

Thanks. I can Python a bit, using basics from C++ plus some googling, but I haven’t read how to Python yet. I have to check out a textbook.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.