How can I find out the most common words in a subtitle text file?

willixm · March 23, 2021, 11:29am

Hey guys!

I’m struggling to find any information on this online.

Basically, I have exported the Japanese subtitles from a Japanese film as a .srt file which I’m viewing through Microsoft Notepad.

I’m hoping to use this file to review some frequently used vocabulary/kanji in the film.

Is there a tool I can use that automatically tell me what the most common kanji are in a text file? Or will I manually have to go through the script?

Thanks in advance, and apologies if this has already been asked in past posts.

Pep95 · March 23, 2021, 11:38am

I’m not sure of any tool, but if you download notepad++ instead, I’m sure there is a plugin that does something like that.

alo · March 23, 2021, 11:38am

The most common words in a single piece of work aren’t going to be very useful. They’ll mostly end up being particles or extremely basic.

That’s why most frequency lists are based on a corpus of work.

That being said, you can try this tool:

tugboatcaptain · March 23, 2021, 11:47am

If there are spaces between the words (I doubt it but you never know…), you can use this online one:

edit: nevermind, I just saw alo’s reply and he links to a much better tool

Pep95 · March 23, 2021, 11:48am

You could add extra spaces between everything with a macro maybe? So then it will definitely work.

d-hermit · March 23, 2021, 11:48am

Finding most frequent kanji in a text file is easy. It’d be just a dozen lines of code in Python.

The difficult part is word frequency because you’d need a way to split a sentence into words correctly. That’s actually pretty difficult and there’s a whole field of natural language processing (NLP) that works on it. I don’t know enough about it to suggest a library or something.

alo · March 23, 2021, 11:54am

Well, I wouldn’t say “better” lol

The last update was 6 years ago so hopefully it still works.

Aye. You’d need to do a lexical analysis to tokenize the actual words and then a parser to figure out the rest. Or something like that.

I don’t really do NLP but I can probably hold a dinner party conversation about it.

d-hermit · March 23, 2021, 12:12pm

I was going to use NLP in my master thesis to do sentiment analysis… but a month before the submission deadline I realised it’s too much work and not even expected in a business school. So I used some very simple stats instead (and got an A anyway).

I only remember terms like tf-idf matrix, tokenization, word stem etc. But I don’t know the practical aspect Nor do I actually understand the theory…

gilledtothegills · March 23, 2021, 1:00pm

I did a lot of this kinda thing when I was working on bunpo check. I’ve bashed together a quick and dirty Python script you can use to do this, you can download it from my Google drive here

How to use it
First, you will need to install a dictionary and a parser in order to run the script. I’m gonna assume you’re using a Mac, but if you’re using Windows then the instructions will be slightly different.

Open the Terminal and run the following commands

python3 -m pip install unidic-lite
python3 -m pip install mecab-python3

Download the Python script linked above and move the script and your subtitle file to the same folder eg. on your Desktop
Let’s say you put them in a folder called “frequency” on your Desktop. Then, you will need to run the following in your terminal. This command will open “frequency” as the current folder in the Terminal

cd ~/Desktop/frequency

Run the script ! (obviously swap yoursubtitlefilename with the actual name of the file)

python3 freq.py yoursubtitlefilename.srt

Check output.txt for all the words/particles and their frequencies, sorted in order

Good luck! Lemme know if you run into any issues

willixm · March 24, 2021, 12:57pm

Wow. I posted this just before work yesterday, and I’m surprised with speed and variety of the responses. I did try a couple of suggestions, but it was just too troublesome.

I have just decided to go over the grammar points I want to practice scene by scene, and look over the vocabulary/kanji as and when I encounter it.

Thanks for your suggestions!

system · March 24, 2022, 12:58pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unique kanji counter tool Kanji	4	1404	November 29, 2020
Looking for a dataset of Japanese words by frequency Resources	5	1855	March 16, 2023
Japanese text analyzer Resources	3	3874	January 6, 2019
Website for checking text difficulty API And Third-Party Apps	22	1650	October 19, 2023
Wondering how to know whether a word is used often WaniKani	16	4866	March 7, 2020

How can I find out the most common words in a subtitle text file?

Related topics