Sentence creator

rootnode · July 7, 2013, 9:58pm

So I’ve been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don’t know yet, but I’ll try to resolve that later.

For everyone interested, here’s the Haskell version of it: http://privatepaste.com/e522758393

When it’s matured enough, I’ll put binaries up here.

Maltose · July 7, 2013, 10:02pm

This has potential. Good potential.

Juichiro · July 7, 2013, 10:08pm

rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later.

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.

It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.

rootnode · July 7, 2013, 10:11pm

Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later.

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.

For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.

Juichiro · July 7, 2013, 10:23pm

rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later.

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.

Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.

rootnode · July 7, 2013, 10:24pm

Juichiro said...
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later.

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.

I could just scrape webcontent. The program uses japanese-only sentences. No translation whatsoever. So I could just crawl some blogs, books, etc and extract the sentences.

EskimoJo · July 7, 2013, 10:39pm

rootnode said...
Juichiro said...
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later.

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.
I could just scrape webcontent. The program uses japanese-only sentences. No translation whatsoever. So I could just crawl some blogs, books, etc and extract the sentences.

So you could use Alc? Or are you wanting to avoid having to search individual kanji/vocab one by one?

Juichiro · July 7, 2013, 10:46pm

EskimoJo said...
rootnode said...
Juichiro said...
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later.

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.
I could just scrape webcontent. The program uses japanese-only sentences. No translation whatsoever. So I could just crawl some blogs, books, etc and extract the sentences.
So you could use Alc? Or are you wanting to avoid having to search individual kanji/vocab one by one?

What is ALC?

Calion · July 7, 2013, 10:52pm

http://www.alc.co.jp/

It’s a Japanese search engine that a lot of translators use (from what I heard) that has a lot of real world sentences. Pretty useful for checking the difference between words and such.

Senjougahara · July 7, 2013, 11:09pm

I really love sentence practice. Learning the readings/meanings is the core, but sentence practice is the last hammer strike that really teaches you how to use vocab. You can start to get a feel for the context of words, so it becomes easier to come up with your own sentences.

>haskell
I like you.

HannahKingsley · July 7, 2013, 11:09pm

rootnode said... For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.

What about the sentences from jisho.org? It seems to me that I remember their sentence database source being open to the public. I can't recall where I found the link to the actual database info, but somewhere they had available the coding to the various sentences (which were worked on with the Tatoeba team, but I think that jisho.org itself is much more reliable).

Breathless · July 7, 2013, 11:23pm

HannahKingsley said...
rootnode said... For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
What about the sentences from jisho.org? It seems to me that I remember their sentence database source being open to the public. I can't recall where I found the link to the actual database info, but somewhere they had available the coding to the various sentences (which were worked on with the Tatoeba team, but I think that jisho.org itself is much more reliable).

Aren't they the same?

Anyway, I'm really looking forward to this! That would be awesome!

rootnode · July 7, 2013, 11:24pm

Senjougahara said...>haskell
I like you.

I needed some small project to play around a bit more with haskell and this was the perfect chance ^^

jimpjorps · July 7, 2013, 11:30pm

HannahKingsley said...
rootnode said... For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
What about the sentences from jisho.org? It seems to me that I remember their sentence database source being open to the public. I can't recall where I found the link to the actual database info, but somewhere they had available the coding to the various sentences (which were worked on with the Tatoeba team, but I think that jisho.org itself is much more reliable).

Jisho's corpus is a subset of Tatoeba.

rykun97 · July 7, 2013, 11:39pm

This is a great idea! If I were you I’d have ask the forum to “donate” sentences from Japanese blogs, articles, etc. and make it a big community project.

rootnode · July 8, 2013, 12:02am

rykun97 said... This is a great idea! If I were you I'd have ask the forum to "donate" sentences from Japanese blogs, articles, etc. and make it a big community project.

That may be an idea for later. First I have to get the algorithm right ^^

salixh5 · July 8, 2013, 12:21am

Wow. The first Haskell program I see which does something… normal. Awesome

rootnode · July 8, 2013, 1:22am

jakobd said... Wow. The first Haskell program I see which does something... normal. Awesome :D

Define "normal" :D I actually use it very often to load/transform/save data that I have to process for work.
Wait..you're from Aachen? Talk about coincidences :D

rootnode · July 8, 2013, 3:42am

Ok, made some improvements. Preprocessing is almost done and the results should be clean.
Furthermore, processing the sentences requires a lot of time. For level 16, it takes about 30 minutes. So it’s not usable for on-the-fly
filtering at this moment. But as soon as it’s stable enough, I could fire up my cluster and process a sentence-pack for every level and then upload it somewhere.

Ethan · July 8, 2013, 4:18am

I can’t help but keep seeing your avatar and think that you are ashamed of this undertaking

Topic		Replies	Views
Sentencegator - small tool that give you sentences based on your WK level API And Third-Party Apps	88	28785	February 16, 2015
I sentence you, WaniKani! (userscript) API And Third-Party Apps	15	3110	January 24, 2014
Master all Wanikani vocab easily in a year, reviewing just 7 cards a day! Questions	61	3081	June 25, 2019
WK Sentences: Lists all the context sentences with only kanji you understand API And Third-Party Apps	10	1286	December 1, 2022
Vocabulary Trainer API And Third-Party Apps	4	329	April 24, 2023

Sentence creator

Related topics