Looking for a dataset of Japanese words by frequency

polv · March 16, 2022, 7:23pm

About “vocabularies”, I feel that I depends on tokenizers, as well as searching methods.

Recently, I tried to search PyPI on Tanaka Corpus (I misremebered for that InnocentCorpus), I found toiro, which includes downloader for other resources.

Actually, it doesn’t even answered Tanaka at all

from toiro import datadownloader, tokenizers

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])

About Innocent Corpus, apparently, Yomichan / AnkiConnect creator cached it.

Furthermore, ja.wikipedia.org can be downloaded, although I don’t exactly know how you should parse it.

I have always known this another way - Python’s wordfreq[cjk]. However, even with this, I don’t feel it is that accurate…

Topic		Replies	Views
Searching for a good Frequency list Speaking	11	2696	August 9, 2021
Vocabulary tables with frequency, JLPT, WK level, and associated WK kanji Japanese Language	10	1012	March 8, 2024
Resource about news Japanese Language	9	449	December 6, 2023
A tool to show useful aggregated data based on your WK level (homonyms, frequency, part of speech, ...) API And Third-Party Apps	6	1065	November 7, 2018
Statistics Mail WaniKani	4	353	November 17, 2019

Looking for a dataset of Japanese words by frequency

Related topics