Any known way to pull data #common from Tangorin.com for selected Kanji, into Spreadsheet format?


#1

It would preferably be in a format like this one https://community.wanikani.com/t/JLPT-Vocabulary-vs-Wanikani/15693, so that I can drill it.

I am doing this for this project:-


There seems to be numerous more N1 Kanji beyond Core 10K, but the amount 40, 112 and 73 are definitely doable.

Only if I am good at scripting…

///Edit (26 Mar 2017):
I have decided to remove Jisho.org from preferred source, so only Tangorin,com at the moment. Anyway, I might come up with a vocabulary list first.


#2

Any known way to pull data #common from Jisho.org for selected Kanji, into Spreadsheet format?

Can you clarify which data and which kanji?


#3

I now prefer http://tangorin.com, and I am not 100% sure that it uses the same database.

Kanji lists:

List A:
冶繭但頒肢侯遵謄采弐朕詔壱丙儒旺嗣抄嫡畝虞痘爵墾塑吏附宵逐褐楼勅硝逓翁薫厘孔斤薪
List B1:
樺脩橘巴渥惟禎苑惣圭祐倭肇漱楠笹晃鷹耀浩匡晋尭朋喬於榛嵯鮎絢蕉巽啄槙彬椿磯怜淳寅
List B2:
毅巳彦弘鴻李伍亘辰佑鳳綜悌柚穣碧邑秦皓卯彪舜允偲黎伽朔汐丑凱甫惇禄皐稀桐琢翠欽慧
List B3:
馨芹孟魁暉毬稜琉槻峻巌洲亨桂玲茅欣郁洸紘稔鵬敦蔦芙宏萩嶺黛酉旭蘭
List C1:
眸亥麿銑鞠茉燿脹詢蕗倖嵩滉伶玖莞錘捺凜裟碩勺頌菫赳彗晟迪袈捷熙柾昂奎丞絃茄胤紬叡
List C2:
椋洵菖勁誼蓉亦燎瑚恕耶梢凪衿匁澪梧琳燦晨綸晏昴爾笙侑椰崚侃紗竣柊瑶

Minimalistic spreadsheet result:

Vocab; Reading; Meaning; Alternative Meaning; Other forms (common only)

Actually, what I care about most at this time is ‘Reading’.


#4

I have a rough idea of how to do this. Let me give it a shot and I’ll get back to you. Output is going to be a csv file of some form.


#5

I’m currently extracting data from jisho right now. I’m using the same script I did on the topic you mention. We’ll see if I succeed :slight_smile:


#6

And here you go @polv that’s the best I could come up with. There’s some missing data data (I have 19850 and jisho has 20670 for #common) and I dont know why but for what you need I think it’s sufficient

Cheers

PS: If that’s something anyone is interested in, here’s the script I used https://gist.github.com/WydD/957d1149fa503d9853cff969a332d9c2


#7

Thanks, but… higher Kanji, especially N1, is often hidden in “Other forms”.

Also, Jisho sometimes appears to show less common words than Tangorin for some reasons.



Lastly, Tangorin also tags whether which “Other forms” is common. So, which alternative Kanji is common, and which reading is the most common.

This will help me decide, for each Kanji, whether On or Kun readings should be emphasized. Which of the Kun readings should be emphasized first. But for this paragraph, I feel that this have to be done manually, and with great care too.

Sorry for slow explanation regarding the difference between Jisho and Tangorin.

Anyway, your old spreadsheet helped me a lot (and I added additional tags – WaniKani Kanji level; rather than WaniKani vocab level). I threw it into Anki for SRS, 10 kanji levels at a time.


#8

I’ve updated the spreadsheet and I’ve added the “other forms” that you can find on the second sheet :slight_smile:

I hope it fits your needs.


#9

Thanks, but some Kanji doesn’t even have a common word in Jisho.

For example, 冶 頒 侯 采, for the first 10 Kanji’s.

I might look into Python to find a way to optimize the code to fit my needs, so, thanks.


#10

Well I think you should just browse jmdict then. “Commonness” is described by tags (news1 ichi1 etc…)

Search through jmdict: http://www.edrdg.org/jmdictdb/cgi-bin/srchform.py
JMDict documentation: http://www.edrdg.org/jmdict/edict_doc.html (section you want to look for is “Word Priority Marking”)

After that you just need to download jmdict in its raw form and parse the content, if you know a bit of scripting it’s easily done.