Hello all devs. This is a request (hopefully to benefit all! :-) )


#1

Hi all developers,

I am not skilled enough to pull this off. But I’d need a script that finds repeated Kanji in a text and deletes the doubles so that there’s only one of each kanji. 

You see, I am working on extracting dialogue from games (I’m starting with SNES) and writing down what kanji needs to be learned, and also big parts of the dialogue/vocab that needs to be learned to be able to play (at least stumble through) the games in Japanese. A script as mentioned above would greatly ease my job.

Is anyone interested? And if the project sounds interesting, please feel free to join me in this endevour.

-Ari


#2

you want a programm where you can paste a text into it press a button and it deletes one symbole if two of the same kind are next to each other?

Or do you want a tool wich reads the text and list every kanja which exist in the text?


#3

So, it sounds like what you’re doing is working through a large body of text, and then cutting it down by removing already added words as you go.
I’ve got next to no script knowledge, but I think I can offer you something useful nonetheless.

I’m working under the assumption that you’ve got the extracted dialogue in some sort of text format (thus accessible to word processors, etc.) and that it’s not riddled with “impurities” from the export process.

0. I think you should abandon the idea of deleting text - it sounds like you are doing it because the process you’re contemplating would require it. It’s a dangerous step (see 1.), and for your actual aim (identifying words and building a deck with them) completely unnecessary.

1. Possibly give up on the idea of attempting what you are trying to with words. It is going to be tough as hell to do accurately. For example, doubles deletion could lead to awful errors, for example, you could delete a two-kanji compound that is identical to two kanji of a three-kanji compound… that won’t go well for your aim. The only thing that could make it safe is prior use of a parser, and we have several (MeCab being one of the more common), and they aren’t too easy to use on their own, and all have (different) faults in how they parse.*

2. Going by kanji should be fine, and the aim of deleting should be unproblematic. To assist that:
2A. Also, consider exporting the dialogue to HTML, and use this WK userscript to highlight the kanji you’ve already learned or are set to in WK.
2B. Additionally, use its add kanji feature to one-by-one add the new kanji you come across, and it’ll highlight them differently. (It, or at least in its earlier forms, is an easy script to modify, if that better helps you.) That’ll let you “save” the words from wrongful deletion, and see if the kanji involved were used. Then use Rikaisama to build an Anki deck in Firefox, with sentences and audio. Check out how to set it up for this here.

tl;dr-> Skip to here:
3. Or, if you can put up with the parsing (and why not! a little error is only a little), and wish to automate the kanji end of things as much as possible, simply use cb’s Analyzer, which uses MeCab for words but independently checks kanji. cb makes some great stuff, and this will do exactly what you want as well as could possibly be done, without designing a new parser. For conveniently making a sweet deck once you open the text in Firefox with Rikai-sama (also cb’s, I believe), see link in 2B.

I could be wrong, but I think that’s exactly what you want.


#4

pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
       kanji.push(glyph)
   }
}


#5

Now I understand you want to paste in a text and the script outputs a list of kanjis which occur in the text? Thats easy to make


#6
jakobd said... pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
       kanji.push(glyph)
   }
}
Don't use an array - that would allow duplicates.

You need to use an associative array, map, symbol table, or dictionary

But as already mentioned you need to parse the text into words, but knowing which kanji to be able to recognise is still useful.




#7

Good idea! In JS it could be done this way http://jsfiddle.net/mzkyyp5u/1/ 


#8
baerrach said...
jakobd said... pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
       kanji.push(glyph)
   }
}
Don't use an array - that would allow duplicates.

You need to use an associative array, map, symbol table, or dictionary

But as already mentioned you need to parse the text into words, but knowing which kanji to be able to recognise is still useful.



 Depends on the language.  For example if you're going with Ruby to code the script you could just do array.uniq! to remove the duplicates.  Although I agree that a hash would probably be the better way to go.

#9

You may want to try http://forum.koohii.com/viewtopic.php?pid=177549 for the vocab.
Also http://manest.github.io/kanji-extractor/ for kanji extraction.


#10
Are said... You may want to try http://forum.koohii.com/viewtopic.php?pid=177549 for the vocab.
Also http://manest.github.io/kanji-extractor/ for kanji extraction.

 Wll, it seems theres no work for me to do anymore^^

#11

tl;dr

asuming you have each kanji in a new line




so on…

You could paste the text in Excel and delete duplicates


#12
baerrach said...
jakobd said... pseudocode

let kanji = array
foreach(glyph in text) {
   if (iskanji(glyph) and !kanji.contains(glyph)) {
       kanji.push(glyph)
   }
}
Don't use an array - that would allow duplicates.

You need to use an associative array, map, symbol table, or dictionary

But as already mentioned you need to parse the text into words, but knowing which kanji to be able to recognise is still useful.



 Looking at the psuedocode I thought that any duplicate issue would have been solved at the if branch "not (kanji contains glyph)", or are you just saying that arrays allow duplicates in general?

#13

thanks for your ideas to improve my little pseudocode. I can see how one could use a key-value store for this, too, and then wouldn’t have to worry about duplicates so much. I think my idea of just traversing the array to look for a duplicate should be equally fine. (which, as Ethan said, was implied by the contains method).


#14

just curiosity, how are you extracting the kanjis/text from the game?


#15

I think all of the above considerations are legitimate. If you use WK you know how different things become when broken up into vocab words and it would be really difficult for a naive script to determine how many actual words there are just based on raw unique kanji characters.

If you do just want unique characters, it’s hard to beat the “set” datatype: https://docs.python.org/2/library/stdtypes.html#set

It is enforced unique and very, very fast.


#16
Subtractem said...If you do just want unique characters, it's hard to beat the "set" datatype: https://docs.python.org/2/library/stdtypes.html#set
 I could think of one way to make look-up and insertion O(1), but you need enough memory. You make an array big enough to store every unicode codepoint (or at least every codepoint that is interesting to you in that text, in this case probably every kanji), then you need a function to map these codepoints to array offsets (which could be very simple, like for example if kanji codepoints are, hypothetically, from 5000 until 20000, then you just calculate x minus 5000), and you prefill that array with zeroes. now you can just access the entry for a specific kanji by directly calling the array offset, and you don't have to iterate over the whole thing. now, because we only need to distinguish between 0 and 1 in this case, we could group array entries together in groups of 8 and use a single byte-array, and then bit masks for manipulation to save some memory. of course, this kind of optimization is probably not necessary, it was just an idea that struck my mind.

#17
jakobd said...
Subtractem said...If you do just want unique characters, it's hard to beat the "set" datatype: https://docs.python.org/2/library/stdtypes.html#set
 I could think of one way to make look-up and insertion O(1), but you need enough memory. You make an array big enough to store every unicode codepoint (or at least every codepoint that is interesting to you in that text, in this case probably every kanji), then you need a function to map these codepoints to array offsets (which could be very simple, like for example if kanji codepoints are, hypothetically, from 5000 until 20000, then you just calculate x minus 5000), and you prefill that array with zeroes. now you can just access the entry for a specific kanji by directly calling the array offset, and you don't have to iterate over the whole thing. now, because we only need to distinguish between 0 and 1 in this case, we could group array entries together in groups of 8 and use a single byte-array, and then bit masks for manipulation to save some memory. of course, this kind of optimization is probably not necessary, it was just an idea that struck my mind.
 That is much more complicated than it has to be. A HashSet is O(1). Hashes/Sets are very worthwhile data types to know about in computer science. The accepted answer here has great resources referenced: http://stackoverflow.com/questions/4558754/define-what-is-a-hashset

Also, computers have so much memory these days that is far from a consideration at this point with the problem that we are currently discussing. I highly doubt this problem requires more than a couple megabytes of memory at most.

#18

A coding approach to this, unless you want a pet project, sounds needlessly complicated.
As two of us have noted, cb’s Text Analyser does what you want with kanji already; and does better than anything short of creating a revolutionary new parser will for vocabulary.
The Firefox extension I mentioned, also by cb, and the beautifully crystal clear WaniKani forum post on setting it up, is probably simultaneously the fastest and richest (part of speech, sentence and audio included in the flashcard) way to import it to Anki.
If you want to go a step further, look up MorphMan on RTK’s wiki and forums, and you can then take your audio-enriched word-in-sentence-context Anki cards, and adjust them so that you’re presented them in optimum learning order, minimising unfamiliar words in the sentences while building up to understanding all of them.

Like others, I’d be curious about how you have text dumped the SNES game, and what condition the text is in (artefacts, etc.) - and of course, what game is it? :slight_smile:
I think someone on WK has a Final Fantasy IV deck up on the Anki site…


#19
Subtractem said... That is much more complicated than it has to be. A HashSet is O(1). Hashes/Sets are very worthwhile data types to know about in computer science. The accepted answer here has great resources referenced: http://stackoverflow.com/questions/4558754/define-what-is-a-hashset

Also, computers have so much memory these days that is far from a consideration at this point with the problem that we are currently discussing. I highly doubt this problem requires more than a couple megabytes of memory at most.
 I read the accepted answer. It sounds pretty similar to my solution. But it's right that you should probably use built-in language features most of the time instead of reinventing the wheel. Still, "internally managing an array and storing the object using an index which is calculated from the hashcode of the object" is pretty much what I had in mind, just that my example hash function isn't very fancy. (But it didn't need to be.)

#20
Subtractem said...
jakobd said...
Subtractem said...If you do just want unique characters, it's hard to beat the "set" datatype: https://docs.python.org/2/library/stdtypes.html#set
 I could think of one way to make look-up and insertion O(1), but you need enough memory. You make an array big enough to store every unicode codepoint (or at least every codepoint that is interesting to you in that text, in this case probably every kanji), then you need a function to map these codepoints to array offsets (which could be very simple, like for example if kanji codepoints are, hypothetically, from 5000 until 20000, then you just calculate x minus 5000), and you prefill that array with zeroes. now you can just access the entry for a specific kanji by directly calling the array offset, and you don't have to iterate over the whole thing. now, because we only need to distinguish between 0 and 1 in this case, we could group array entries together in groups of 8 and use a single byte-array, and then bit masks for manipulation to save some memory. of course, this kind of optimization is probably not necessary, it was just an idea that struck my mind.
 That is much more complicated than it has to be. A HashSet is O(1). Hashes/Sets are very worthwhile data types to know about in computer science. The accepted answer here has great resources referenced: http://stackoverflow.com/questions/4558754/define-what-is-a-hashset

Also, computers have so much memory these days that is far from a consideration at this point with the problem that we are currently discussing. I highly doubt this problem requires more than a couple megabytes of memory at most.
 As mush as I like using maps and hastables, in this case while it would be easier the trivial function to map a code point to a memory location would save both memory and processing power over a hash function.
I'm also a bit intrigued by the use of a hash function (which will output a most likely value greater than 2 bytes) while a trivial hash function would just take the code point (which fits 2 bytes).

An easy implementation with a C++ vector<bool> would take 65536 bits (8K) to store every 2-byte code point. You run on the text converted in UCS-2 to avoid issues, put a true every time you find a code point then you can just output all the code points you found (with some filtering if you want to remove some). UTF-8 makes it more a pain because of differences in the length of a character (technically UCS-2 can have 4 bytes characters but you're not breaking anything by splitting them up).

tldr; 2 solutions being O(1) doesn't mean one can be 1000 times faster and use less memory
Complexity is hardly a problem when you're considering relatively small sets anyway