Looking for a dataset of Japanese words by frequency

Hi,

I’m writing a program for myself, which would use the most frequent words (either 100 or 1000, haven’t decided yet) and then randomly sample a few each day and test me on them

I haven’t found a good dataset for this, for now I did by with scraping a Wiktionary article on most frequent words, but it’s not ideal (it works but it doesn’t show the frequencies or the source, so I can’t really trust it + it’s a bit PITA as needing any extra info on the word requires to open another page)

Does anyone have anything I could use?

Alternatively, I’ll finish scraping from Wiki and then I can share my own dataset here

Thanks

3 Likes

It may depend on what your most important source/sources are reading will be (for example, newspapers or fiction or Wikipedia as examples), but if it is going to be based on fiction or books to some degree, there is a novel 5K list on Reddit that may be of help since it has frequency distributions and top 6 patterns of all kinds.

The kanji frequency shown on wkstats.com and Jisho (both also free) is based on Japanese newspapers (or Joyo/JLPT if that is of any help), but it doesn’t help with vocabulary by itself. You could still work backwards from that and the vocab top 6 from the novel 5K if you want to focus on newspapers, but it won’t be a “most frequent 100/1000” list (it would be a decent approximation though). However, at that point, you may be better served with finding a standard 1K list - an Anki deck for example - since it would be at least pretty similar as well.

Hope this helps or at least generate some food for thought!

1 Like

Various sources from the Wanikani Open Framework JLPT, Joyo, and Frequency filters. I am not sure whether this data is up to date.

JLPT Source: JLPT Kanji
Joyo Source: List of jōyō kanji - Wikipedia
Total frequency Source: Kanji orderd by frequency of use - Kanjicards.org
NHK frequency Source: kanji data for NHK News Web Easy · GitHub
Other frequency sources: topokanji/data/kanji-frequency at master · scriptin/topokanji · GitHub

1 Like

About “vocabularies”, I feel that I depends on tokenizers, as well as searching methods.

Recently, I tried to search PyPI on Tanaka Corpus (I misremebered for that InnocentCorpus), I found toiro, which includes downloader for other resources.

Actually, it doesn’t even answered Tanaka at all :rofl:

from toiro import datadownloader, tokenizers

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])

About Innocent Corpus, apparently, Yomichan / AnkiConnect creator cached it.

Furthermore, ja.wikipedia.org can be downloaded, although I don’t exactly know how you should parse it.

I have always known this another way - Python’s wordfreq[cjk]. However, even with this, I don’t feel it is that accurate…

1 Like

it’s been a while but iirc I ended up just scraping Wiktionary

I’ll post it here just in case it’s helpful for anyone but it’s very rough, I didn’t really care so much as it was just for me

import pandas as pd
import requests
import bs4


def get_soup():
    # TODO:
    #   - Cache instead of downloading everytime
    #   - Alternatively use https://learnjapanesedaily.com/most-common-japanese-words.html
    #     maybe better since I should test myself on active (EN -> JP) not just passive knowledge

    req = requests.get("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Japanese")
    assert req.status_code == requests.codes.ok
    soup = bs4.BeautifulSoup(req.text, features="lxml")
    word_soup = soup.select(".Jpan a")

    hrefs = []
    words = []
    for row in word_soup:
        hrefs.append(row.get("href"))
        words.append(row.get("title"))

    word_df = pd.DataFrame(
        {
            "words": words,
            "hrefs": hrefs
        }
    )

    return word_df


df = get_soup()

# Get 100 and 1000 most common words
top_100 = df.iloc[0:101]
top_1000 = df.iloc[0:1001]

# Random sample of 5 from 100 most common
sample = top_100.sample(5)

print(list(sample['words']))

and then I would rerun the last two rows and get new set everytime… I didn’t finish the whole thing, it’s just basic functionality

(also as you can probably guess I don’t really use it lol, as always I have an idea and never finish it)

1 Like