Finding most frequent kanji in a text file is easy. It’d be just a dozen lines of code in Python.
The difficult part is word frequency because you’d need a way to split a sentence into words correctly. That’s actually pretty difficult and there’s a whole field of natural language processing (NLP) that works on it. I don’t know enough about it to suggest a library or something.
I was going to use NLP in my master thesis to do sentiment analysis… but a month before the submission deadline I realised it’s too much work and not even expected in a business school. So I used some very simple stats instead (and got an A anyway).
I only remember terms like tf-idf matrix, tokenization, word stem etc. But I don’t know the practical aspect Nor do I actually understand the theory…
I did a lot of this kinda thing when I was working on bunpo check. I’ve bashed together a quick and dirty Python script you can use to do this, you can download it from my Google drive here
How to use it
First, you will need to install a dictionary and a parser in order to run the script. I’m gonna assume you’re using a Mac, but if you’re using Windows then the instructions will be slightly different.
Download the Python script linked above and move the script and your subtitle file to the same folder eg. on your Desktop
Let’s say you put them in a folder called “frequency” on your Desktop. Then, you will need to run the following in your terminal. This command will open “frequency” as the current folder in the Terminal
cd ~/Desktop/frequency
Run the script ! (obviously swap yoursubtitlefilename with the actual name of the file)
python3 freq.py yoursubtitlefilename.srt
Check output.txt for all the words/particles and their frequencies, sorted in order
Wow. I posted this just before work yesterday, and I’m surprised with speed and variety of the responses. I did try a couple of suggestions, but it was just too troublesome.
I have just decided to go over the grammar points I want to practice scene by scene, and look over the vocabulary/kanji as and when I encounter it.