Sentence creator


#1

So I’ve been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don’t know yet, but I’ll try to resolve that later. 

For everyone interested, here’s the Haskell version of it: http://privatepaste.com/e522758393

When it’s matured enough, I’ll put binaries up here.


#2

:smiley: This has potential. Good potential.


#3
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later. 

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
 It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.

#4
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later. 

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
 It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
 For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.

#5
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later. 

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
 It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
 For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.

#6
Juichiro said...
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later. 

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
 It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
 For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.
 I could just scrape webcontent. The program uses japanese-only sentences. No translation whatsoever. So I could just crawl some blogs, books, etc and extract the sentences.

#7
rootnode said...
Juichiro said...
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later. 

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
 It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
 For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.
 I could just scrape webcontent. The program uses japanese-only sentences. No translation whatsoever. So I could just crawl some blogs, books, etc and extract the sentences.
So you could use Alc? Or are you wanting to avoid having to search individual kanji/vocab one by one?

#8
EskimoJo said...
rootnode said...
Juichiro said...
rootnode said...
Juichiro said...
rootnode said... So I've been fiddling around with an idea and came up with a first, very rough, prototype. The idea is to fetch all the vocabulary you have learned so far on WK, and then filter out all sentences that contain only these Kanji from a huge sentence corpus. It still produces some wrong results where two Kanji you know build a compound you don't know yet, but I'll try to resolve that later. 

For everyone interested, here's the Haskell version of it: http://privatepaste.com/e522758393

When it's matured enough, I'll put binaries up here.
 It sounds like a great idea. But where will you find this huge sentence corpus? I think this is the biggest hurdle. The Tanaka/Tatoeba corpus is useless due to machine translations, lack of revision, etc.
 For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
Such list does not seem to exist. :( The crazy experimental researchers at AJATT used to hype learning based on sentences. This might have decks available to you.
 I could just scrape webcontent. The program uses japanese-only sentences. No translation whatsoever. So I could just crawl some blogs, books, etc and extract the sentences.
So you could use Alc? Or are you wanting to avoid having to search individual kanji/vocab one by one?
 What is ALC?

#9

http://www.alc.co.jp/

It’s a Japanese search engine that a lot of translators use (from what I heard) that has a lot of real world sentences. Pretty useful for checking the difference between words and such. 


#10

I really love sentence practice. Learning the readings/meanings is the core, but sentence practice is the last hammer strike that really teaches you how to use vocab. You can start to get a feel for the context of words, so it becomes easier to come up with your own sentences.

>haskell
I like you.


#11
rootnode said... For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
What about the sentences from jisho.org?  It seems to me that I remember their sentence database source being open to the public.  I can't recall where I found the link to the actual database info, but somewhere they had available the coding to the various sentences (which were worked on with the Tatoeba team, but I think that jisho.org itself is much more reliable). 

#12
HannahKingsley said...
rootnode said... For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
What about the sentences from jisho.org?  It seems to me that I remember their sentence database source being open to the public.  I can't recall where I found the link to the actual database info, but somewhere they had available the coding to the various sentences (which were worked on with the Tatoeba team, but I think that jisho.org itself is much more reliable). 
 Aren't they the same?

Anyway, I'm really looking forward to this! That would be awesome!

#13
Senjougahara said...>haskell
I like you.

 I needed some small project to play around a bit more with haskell and this was the perfect chance ^^

#14
HannahKingsley said...
rootnode said... For now it's the Tatoeba corpus, but without translation, just the japanese ones. Later on I'll just collect sentences from websites, texts, etc.
I just need a huge list of japanese sentences. Tatoeba is only used for testing up until now.
What about the sentences from jisho.org?  It seems to me that I remember their sentence database source being open to the public.  I can't recall where I found the link to the actual database info, but somewhere they had available the coding to the various sentences (which were worked on with the Tatoeba team, but I think that jisho.org itself is much more reliable). 
Jisho's corpus is a subset of Tatoeba.

#15

This is a great idea! If I were you I’d have ask the forum to “donate” sentences from Japanese blogs, articles, etc. and make it a big community project.


#16
rykun97 said... This is a great idea! If I were you I'd have ask the forum to "donate" sentences from Japanese blogs, articles, etc. and make it a big community project.
 That may be an idea for later. First I have to get the algorithm right ^^

#17

Wow. The first Haskell program I see which does something… normal. Awesome :smiley:


#18
jakobd said... Wow. The first Haskell program I see which does something... normal. Awesome :D
 Define "normal" :D I actually use it very often to load/transform/save data that I have to process for work.
Wait..you're from Aachen? Talk about coincidences :D

#19

Ok, made some improvements. Preprocessing is almost done and the results should be clean.
Furthermore, processing the sentences requires a lot of time. For level 16, it takes about 30 minutes. So it’s not usable for on-the-fly 
filtering at this moment. But as soon as it’s stable enough, I could fire up my cluster and process a sentence-pack for every level and then upload it somewhere.


#20

I can’t help but keep seeing your avatar and think that you are ashamed of this undertaking :stuck_out_tongue: