How can I find out the most common words in a subtitle text file?

Hey guys!

I’m struggling to find any information on this online.

Basically, I have exported the Japanese subtitles from a Japanese film as a .srt file which I’m viewing through Microsoft Notepad.

I’m hoping to use this file to review some frequently used vocabulary/kanji in the film.

Is there a tool I can use that automatically tell me what the most common kanji are in a text file? Or will I manually have to go through the script?

Thanks in advance, and apologies if this has already been asked in past posts.

I’m not sure of any tool, but if you download notepad++ instead, I’m sure there is a plugin that does something like that.

The most common words in a single piece of work aren’t going to be very useful. They’ll mostly end up being particles or extremely basic.

That’s why most frequency lists are based on a corpus of work.

That being said, you can try this tool:

3 Likes

If there are spaces between the words (I doubt it but you never know…), you can use this online one:

edit: nevermind, I just saw alo’s reply and he links to a much better tool :slight_smile:

3 Likes

You could add extra spaces between everything with a macro maybe? So then it will definitely work.

Finding most frequent kanji in a text file is easy. It’d be just a dozen lines of code in Python.

The difficult part is word frequency because you’d need a way to split a sentence into words correctly. That’s actually pretty difficult and there’s a whole field of natural language processing (NLP) that works on it. I don’t know enough about it to suggest a library or something.

5 Likes

Well, I wouldn’t say “better” lol

The last update was 6 years ago so hopefully it still works.

Aye. You’d need to do a lexical analysis to tokenize the actual words and then a parser to figure out the rest. Or something like that.

I don’t really do NLP but I can probably hold a dinner party conversation about it. :wink:

I was going to use NLP in my master thesis to do sentiment analysis… but a month before the submission deadline I realised it’s too much work and not even expected in a business school. So I used some very simple stats instead (and got an A anyway).

I only remember terms like tf-idf matrix, tokenization, word stem etc. But I don’t know the practical aspect :sweat_smile: Nor do I actually understand the theory…

1 Like

I did a lot of this kinda thing when I was working on bunpo check. I’ve bashed together a quick and dirty Python script you can use to do this, you can download it from my Google drive here

How to use it
First, you will need to install a dictionary and a parser in order to run the script. I’m gonna assume you’re using a Mac, but if you’re using Windows then the instructions will be slightly different.

  1. Open the Terminal and run the following commands
python3 -m pip install unidic-lite
python3 -m pip install mecab-python3
  1. Download the Python script linked above and move the script and your subtitle file to the same folder eg. on your Desktop
  2. Let’s say you put them in a folder called “frequency” on your Desktop. Then, you will need to run the following in your terminal. This command will open “frequency” as the current folder in the Terminal
cd ~/Desktop/frequency 
  1. Run the script ! (obviously swap yoursubtitlefilename with the actual name of the file)
python3 freq.py yoursubtitlefilename.srt
  1. Check output.txt for all the words/particles and their frequencies, sorted in order

Good luck! Lemme know if you run into any issues :stuck_out_tongue:

4 Likes

Wow. I posted this just before work yesterday, and I’m surprised with speed and variety of the responses. I did try a couple of suggestions, but it was just too troublesome.

I have just decided to go over the grammar points I want to practice scene by scene, and look over the vocabulary/kanji as and when I encounter it.

Thanks for your suggestions!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.