Looking for a dataset of Japanese words by frequency

About “vocabularies”, I feel that I depends on tokenizers, as well as searching methods.

Recently, I tried to search PyPI on Tanaka Corpus (I misremebered for that InnocentCorpus), I found toiro, which includes downloader for other resources.

Actually, it doesn’t even answered Tanaka at all :rofl:

from toiro import datadownloader, tokenizers

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])

About Innocent Corpus, apparently, Yomichan / AnkiConnect creator cached it.

Furthermore, ja.wikipedia.org can be downloaded, although I don’t exactly know how you should parse it.

I have always known this another way - Python’s wordfreq[cjk]. However, even with this, I don’t feel it is that accurate…

1 Like