Hello all devs. This is a request (hopefully to benefit all! :-) )

arimail · February 12, 2015, 7:48am

Hi all developers,

I am not skilled enough to pull this off. But I’d need a script that finds repeated Kanji in a text and deletes the doubles so that there’s only one of each kanji.

You see, I am working on extracting dialogue from games (I’m starting with SNES) and writing down what kanji needs to be learned, and also big parts of the dialogue/vocab that needs to be learned to be able to play (at least stumble through) the games in Japanese. A script as mentioned above would greatly ease my job.

Is anyone interested? And if the project sounds interesting, please feel free to join me in this endevour.

-Ari

zdennis · February 12, 2015, 8:09am

you want a programm where you can paste a text into it press a button and it deletes one symbole if two of the same kind are next to each other?

Or do you want a tool wich reads the text and list every kanja which exist in the text?

ocac · February 12, 2015, 8:27am

So, it sounds like what you’re doing is working through a large body of text, and then cutting it down by removing already added words as you go.
I’ve got next to no script knowledge, but I think I can offer you something useful nonetheless.

I’m working under the assumption that you’ve got the extracted dialogue in some sort of text format (thus accessible to word processors, etc.) and that it’s not riddled with “impurities” from the export process.

0. I think you should abandon the idea of deleting text - it sounds like you are doing it because the process you’re contemplating would require it. It’s a dangerous step (see 1.), and for your actual aim (identifying words and building a deck with them) completely unnecessary.

1. Possibly give up on the idea of attempting what you are trying to with words. It is going to be tough as hell to do accurately. For example, doubles deletion could lead to awful errors, for example, you could delete a two-kanji compound that is identical to two kanji of a three-kanji compound… that won’t go well for your aim. The only thing that could make it safe is prior use of a parser, and we have several (MeCab being one of the more common), and they aren’t too easy to use on their own, and all have (different) faults in how they parse.*

2. Going by kanji should be fine, and the aim of deleting should be unproblematic. To assist that:
2A. Also, consider exporting the dialogue to HTML, and use this WK userscript to highlight the kanji you’ve already learned or are set to in WK.
2B. Additionally, use its add kanji feature to one-by-one add the new kanji you come across, and it’ll highlight them differently. (It, or at least in its earlier forms, is an easy script to modify, if that better helps you.) That’ll let you “save” the words from wrongful deletion, and see if the kanji involved were used. Then use Rikaisama to build an Anki deck in Firefox, with sentences and audio. Check out how to set it up for this here.

tl;dr-> Skip to here:
3. Or, if you can put up with the parsing (and why not! a little error is only a little), and wish to automate the kanji end of things as much as possible, simply use cb’s Analyzer, which uses MeCab for words but independently checks kanji. cb makes some great stuff, and this will do exactly what you want as well as could possibly be done, without designing a new parser. For conveniently making a sweet deck once you open the text in Firefox with Rikai-sama (also cb’s, I believe), see link in 2B.

I could be wrong, but I think that’s exactly what you want.

salixh5 · February 12, 2015, 8:33am

pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
     kanji.push(glyph)
   }
}

zdennis · February 12, 2015, 9:37am

Now I understand you want to paste in a text and the script outputs a list of kanjis which occur in the text? Thats easy to make

baerrach · February 12, 2015, 10:47am

jakobd said... pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
     kanji.push(glyph)
   }
}

Don't use an array - that would allow duplicates.

You need to use an associative array, map, symbol table, or dictionary

But as already mentioned you need to parse the text into words, but knowing which kanji to be able to recognise is still useful.

delacannon · February 12, 2015, 1:31pm

Good idea! In JS it could be done this way http://jsfiddle.net/mzkyyp5u/1/

wolph · February 12, 2015, 2:06pm

baerrach said...
jakobd said... pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
     kanji.push(glyph)
   }
}

Don't use an array - that would allow duplicates.

You need to use an associative array, map, symbol table, or dictionary

But as already mentioned you need to parse the text into words, but knowing which kanji to be able to recognise is still useful.

Depends on the language. For example if you're going with Ruby to code the script you could just do array.uniq! to remove the duplicates. Although I agree that a hash would probably be the better way to go.

Are · February 12, 2015, 2:10pm

You may want to try http://forum.koohii.com/viewtopic.php?pid=177549 for the vocab.
Also http://manest.github.io/kanji-extractor/ for kanji extraction.

zdennis · February 12, 2015, 3:50pm

Are said... You may want to try http://forum.koohii.com/viewtopic.php?pid=177549 for the vocab.
Also http://manest.github.io/kanji-extractor/ for kanji extraction.

Wll, it seems theres no work for me to do anymore^^

chepe263 · February 12, 2015, 4:07pm

tl;dr

asuming you have each kanji in a new line

漢
字
車
so on…

You could paste the text in Excel and delete duplicates

Ethan · February 12, 2015, 5:02pm

baerrach said...
jakobd said... pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
     kanji.push(glyph)
   }
}

Don't use an array - that would allow duplicates.

You need to use an associative array, map, symbol table, or dictionary

But as already mentioned you need to parse the text into words, but knowing which kanji to be able to recognise is still useful.

Looking at the psuedocode I thought that any duplicate issue would have been solved at the if branch "not (kanji contains glyph)", or are you just saying that arrays allow duplicates in general?

salixh5 · February 12, 2015, 5:18pm

thanks for your ideas to improve my little pseudocode. I can see how one could use a key-value store for this, too, and then wouldn’t have to worry about duplicates so much. I think my idea of just traversing the array to look for a duplicate should be equally fine. (which, as Ethan said, was implied by the contains method).

Vicfred · February 12, 2015, 5:31pm

just curiosity, how are you extracting the kanjis/text from the game?

Subtractem · February 12, 2015, 5:32pm

I think all of the above considerations are legitimate. If you use WK you know how different things become when broken up into vocab words and it would be really difficult for a naive script to determine how many actual words there are just based on raw unique kanji characters.

If you do just want unique characters, it’s hard to beat the “set” datatype: https://docs.python.org/2/library/stdtypes.html#set

It is enforced unique and very, very fast.

salixh5 · February 12, 2015, 5:58pm

Subtractem said...If you do just want unique characters, it's hard to beat the "set" datatype: https://docs.python.org/2/library/stdtypes.html#set

I could think of one way to make look-up and insertion O(1), but you need enough memory. You make an array big enough to store every unicode codepoint (or at least every codepoint that is interesting to you in that text, in this case probably every kanji), then you need a function to map these codepoints to array offsets (which could be very simple, like for example if kanji codepoints are, hypothetically, from 5000 until 20000, then you just calculate x minus 5000), and you prefill that array with zeroes. now you can just access the entry for a specific kanji by directly calling the array offset, and you don't have to iterate over the whole thing. now, because we only need to distinguish between 0 and 1 in this case, we could group array entries together in groups of 8 and use a single byte-array, and then bit masks for manipulation to save some memory. of course, this kind of optimization is probably not necessary, it was just an idea that struck my mind.

Subtractem · February 12, 2015, 9:52pm

jakobd said...
Subtractem said...If you do just want unique characters, it's hard to beat the "set" datatype: https://docs.python.org/2/library/stdtypes.html#set

I could think of one way to make look-up and insertion O(1), but you need enough memory. You make an array big enough to store every unicode codepoint (or at least every codepoint that is interesting to you in that text, in this case probably every kanji), then you need a function to map these codepoints to array offsets (which could be very simple, like for example if kanji codepoints are, hypothetically, from 5000 until 20000, then you just calculate x minus 5000), and you prefill that array with zeroes. now you can just access the entry for a specific kanji by directly calling the array offset, and you don't have to iterate over the whole thing. now, because we only need to distinguish between 0 and 1 in this case, we could group array entries together in groups of 8 and use a single byte-array, and then bit masks for manipulation to save some memory. of course, this kind of optimization is probably not necessary, it was just an idea that struck my mind.

That is much more complicated than it has to be. A HashSet is O(1). Hashes/Sets are very worthwhile data types to know about in computer science. The accepted answer here has great resources referenced: http://stackoverflow.com/questions/4558754/define-what-is-a-hashset

Also, computers have so much memory these days that is far from a consideration at this point with the problem that we are currently discussing. I highly doubt this problem requires more than a couple megabytes of memory at most.

ocac · February 13, 2015, 1:02am

A coding approach to this, unless you want a pet project, sounds needlessly complicated.
As two of us have noted, cb’s Text Analyser does what you want with kanji already; and does better than anything short of creating a revolutionary new parser will for vocabulary.
The Firefox extension I mentioned, also by cb, and the beautifully crystal clear WaniKani forum post on setting it up, is probably simultaneously the fastest and richest (part of speech, sentence and audio included in the flashcard) way to import it to Anki.
If you want to go a step further, look up MorphMan on RTK’s wiki and forums, and you can then take your audio-enriched word-in-sentence-context Anki cards, and adjust them so that you’re presented them in optimum learning order, minimising unfamiliar words in the sentences while building up to understanding all of them.

Like others, I’d be curious about how you have text dumped the SNES game, and what condition the text is in (artefacts, etc.) - and of course, what game is it?
I think someone on WK has a Final Fantasy IV deck up on the Anki site…

salixh5 · February 13, 2015, 1:07am

Subtractem said... That is much more complicated than it has to be. A HashSet is O(1). Hashes/Sets are very worthwhile data types to know about in computer science. The accepted answer here has great resources referenced: http://stackoverflow.com/questions/4558754/define-what-is-a-hashset

Also, computers have so much memory these days that is far from a consideration at this point with the problem that we are currently discussing. I highly doubt this problem requires more than a couple megabytes of memory at most.

I read the accepted answer. It sounds pretty similar to my solution. But it's right that you should probably use built-in language features most of the time instead of reinventing the wheel. Still, "internally managing an array and storing the object using an index which is calculated from the hashcode of the object" is pretty much what I had in mind, just that my example hash function isn't very fancy. (But it didn't need to be.)

meneldal · February 13, 2015, 4:10am

Subtractem said...
jakobd said...
Subtractem said...If you do just want unique characters, it's hard to beat the "set" datatype: https://docs.python.org/2/library/stdtypes.html#set

I could think of one way to make look-up and insertion O(1), but you need enough memory. You make an array big enough to store every unicode codepoint (or at least every codepoint that is interesting to you in that text, in this case probably every kanji), then you need a function to map these codepoints to array offsets (which could be very simple, like for example if kanji codepoints are, hypothetically, from 5000 until 20000, then you just calculate x minus 5000), and you prefill that array with zeroes. now you can just access the entry for a specific kanji by directly calling the array offset, and you don't have to iterate over the whole thing. now, because we only need to distinguish between 0 and 1 in this case, we could group array entries together in groups of 8 and use a single byte-array, and then bit masks for manipulation to save some memory. of course, this kind of optimization is probably not necessary, it was just an idea that struck my mind.

That is much more complicated than it has to be. A HashSet is O(1). Hashes/Sets are very worthwhile data types to know about in computer science. The accepted answer here has great resources referenced: http://stackoverflow.com/questions/4558754/define-what-is-a-hashset

Also, computers have so much memory these days that is far from a consideration at this point with the problem that we are currently discussing. I highly doubt this problem requires more than a couple megabytes of memory at most.

As mush as I like using maps and hastables, in this case while it would be easier the trivial function to map a code point to a memory location would save both memory and processing power over a hash function.
I'm also a bit intrigued by the use of a hash function (which will output a most likely value greater than 2 bytes) while a trivial hash function would just take the code point (which fits 2 bytes).

An easy implementation with a C++ vector<bool> would take 65536 bits (8K) to store every 2-byte code point. You run on the text converted in UCS-2 to avoid issues, put a true every time you find a code point then you can just output all the code points you found (with some filtering if you want to remove some). UTF-8 makes it more a pain because of differences in the length of a character (technically UCS-2 can have 4 bytes characters but you're not breaking anything by splitting them up).

tldr; 2 solutions being O(1) doesn't mean one can be 1000 times faster and use less memory
Complexity is hardly a problem when you're considering relatively small sets anyway

Topic		Replies	Views
Quick add-in that might help people a lot Feedback	9	1813	March 21, 2019
Script that highlights Kanji readings API And Third-Party Apps	3	527	December 11, 2018
Katakana Extension API And Third-Party Apps	4	641	September 2, 2015
[Userscript] Multiple Choice Kanji Quiz API And Third-Party Apps	5	245	June 17, 2024
Minimalist rapid-fire kana review tool Resources	6	1258	November 7, 2018

Hello all devs. This is a request (hopefully to benefit all! :-) )

Related topics