Japanese-to-Japanese data sets, available for commercial use?


#1

Hi,

While studying kanji, and Japanese in general, I’ve been tinkering with code to create a tool that will help me learn. What I have in mind isn’t, as far as I can tell, similar to anything that I could find on the Internet so far. At the moment, it mostly involves automatic analysis of big data sets, to extract from them the information and patterns I need.

As a source for the enterprise, I’ve been using Kanjium, Kanji VG, JMDict and tatoeba so far. This is because these resources present data sets that have Creative Commons or similar license; specifically, it is permitted to use them as a source in a commercial project - and I have in mind that with the work I am planning to put into the project, it would be nice if it could perhaps someday pay the bills.

But these resources are often described by Wanikani forum residents as flawed and tainted with incorrect information, and sentences no Japanese speaker would utter. As a shining example, Japanese-to-Japanese resources are given, presumably put together by professionals, vetted etc. In general, in the latter stage of learning, using Japanese-to-Japanese resources is preferred, in order to cut ties with the source language from which we are learning, and learn real dependencies and connections between words and characters in Japanese. Therefore I would prefer to use Japanese-to-Japanese resources, like weblio, kanjipedia and presumably many others.

So far, my Japanese skills aren’t good enough to figure out whether any of these resources, however, present any data set which I could use freely in a commercial enterprise. And there may be better resources around, which, with my limited ability to read in Japanese, I am entirely missing.

So, a question (which might be useful to other people as well, if they want to find such sources): does anybody know of Japanese-to-Japanese data sets, available under a license which allows using them as a source in commercial applications or websites?

Or available for licensing, after paying a fee low enough that it wouldn’t leave commercial application unfeasible? And not requiring personal visit to the institution in question, or ability to communicate with them in fluid Japanese?


#2

Not sure what you are looking for, but maybe the sources of the Kodansha Kanji Learners Course Grade Reader Set Vol. 1 contain something you find useful.

Sources
The author wishes to thank the organizations listed below for licensing their copyrighted materials and/ or helping to disseminate public domain materials. Title to copyright in all materials not in the public domain remains with the organizations listed below. All licenses listed below extend also to the reader, under the same conditions provided at each license’s linked webpage.

  1. Aozora Bunko [青空文庫]: All cited materials are in the public domain. To access, visit aozora.gr.jp.
  2. Embassy of the United States in Japan: All cited materials are in the public domain. To access, visit japan.usembassy.gov.
  3. Kurohashi/ Kawahara Lab at Kyoto University [京都大学黒橋・河原研究室]: All cited materials are used under the Creative Commons Attribution 3.0 Unported license (creativecommons.org/ licenses/ by/ 3.0). To access, visit nlp.ist.i.kyoto-u.ac.jp.
  4. Lexica Global Language Systems, LLC: On behalf of Lexica, the author donates his Japanese translations (covering original works #21, 38, 56, 59, 63, 65, 66, 78, 80, and 94) to the public domain.
  5. Librivox.org: All cited materials are in the public domain. To access, visit librivox.org.
  6. Ministry of Foreign Affairs, Japan: All cited materials are in the public domain. To access, visit http://mofa.go.jp/region/n-america/us/q&a/ref/2.html.
  7. Ministry of Justice, Japan: All cited materials are in the public domain. To access, visit http://japaneselawtranslation.go.jp/index/terms_of_use/?re=02.
  8. National Institute of Information and Communications Technology (NICT), Japan [情報通信研究機構]: Basic English Sentence Data [英語基本文データ] used under the Creative Commons Attribution 3.0 Unported License (creativecommons.org/ licenses/ by/ 3.0). To access, visit http://nlp.ist.i.kyoto-u.ac.jp. Japanese Wordnet used under public license granted by NICT. To access, visit http://nlpwww.nict.go.jp/wn-ja/index.en.html. All other cited materials used under the Creative Commons Attribution 1.0 License. To access, visit nict.go.jp.
  9. Princeton University: Wordnet 3.0 used under public license granted by Princeton University. To access, visit http://wordnet.princeton.edu/wordnet.
  10. Project Gutenberg: All cited materials are in the public domain. To access, visit gutenberg.org.
  11. Saylor.org: All cited materials are in the public domain. To access, visit saylor.org.
  12. Sugita Genpaku Project [プロジェクト杉田玄白]: Free public license granted for all cited materials. To access, visit genpaku.org/sugitalist01.html.
  13. Tatoeba.org: All cited materials are used under the Creative Commons Attribution 2.0 license (creativecommons.org/ licenses/ by/ 2.0/). To access, visit tatoeba.org.
  14. United Nations: All cited materials are in the public domain. To access, visit: (English text): http://un.org/en/universal-declaration-human-rights (Japanese text): ohchr.org/en/udhr/pages/language.aspx?langID=jpn.
  15. Wikisource.org: The English and Japanese versions of the Treaty of San Francisco, and the English version of the Treaty of Mutual Cooperation and Security between Japan and the United States of America, are used under the Creative Commons Attribution-ShareAlike License (creativecommons.org/licenses/by-sa/3.0/). All other cited materials are in the public domain. To access these materials, visit wikisource.org.
  16. Wordplanet.org: All cited materials are in the public domain. To access, visit wordplanet.org.

#3

Thank you. I meant databases similar to the ones I enumerated, expected to be parsed by algorithms, but for Japanese-to-Japanese content, like encyclopedias of words, sentences or kanji, or even web interfaces to Public Domain/Creative Commons content. The more different sources I have available, the more possibilities for cross-checking them and picking out differences, but I can’t use any source with a license that would prohibit commercial use.


#4

https://www.ninjal.ac.jp/english/database/type/corpora/

check these out.

I would also suggest just getting in touch with someone in academia that does something close to what you want to do (some who works in machine language processing, corpus linguistics, or computational linguistics in japanese) and ask them. Any researcher will be familiar with the data sets available.


#5

Thank you. These resources might get me where I want to get :slight_smile: