I’m writing a program for myself, which would use the most frequent words (either 100 or 1000, haven’t decided yet) and then randomly sample a few each day and test me on them
I haven’t found a good dataset for this, for now I did by with scraping a Wiktionary article on most frequent words, but it’s not ideal (it works but it doesn’t show the frequencies or the source, so I can’t really trust it + it’s a bit PITA as needing any extra info on the word requires to open another page)
Does anyone have anything I could use?
Alternatively, I’ll finish scraping from Wiki and then I can share my own dataset here
It may depend on what your most important source/sources are reading will be (for example, newspapers or fiction or Wikipedia as examples), but if it is going to be based on fiction or books to some degree, there is a novel 5K list on Reddit that may be of help since it has frequency distributions and top 6 patterns of all kinds.
The kanji frequency shown on wkstats.com and Jisho (both also free) is based on Japanese newspapers (or Joyo/JLPT if that is of any help), but it doesn’t help with vocabulary by itself. You could still work backwards from that and the vocab top 6 from the novel 5K if you want to focus on newspapers, but it won’t be a “most frequent 100/1000” list (it would be a decent approximation though). However, at that point, you may be better served with finding a standard 1K list - an Anki deck for example - since it would be at least pretty similar as well.
Hope this helps or at least generate some food for thought!
About “vocabularies”, I feel that I depends on tokenizers, as well as searching methods.
Recently, I tried to search PyPI on Tanaka Corpus (I misremebered for that InnocentCorpus), I found toiro, which includes downloader for other resources.
Actually, it doesn’t even answered Tanaka at all
from toiro import datadownloader, tokenizers
# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']
available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])
it’s been a while but iirc I ended up just scraping Wiktionary
I’ll post it here just in case it’s helpful for anyone but it’s very rough, I didn’t really care so much as it was just for me
import pandas as pd
import requests
import bs4
def get_soup():
# TODO:
# - Cache instead of downloading everytime
# - Alternatively use https://learnjapanesedaily.com/most-common-japanese-words.html
# maybe better since I should test myself on active (EN -> JP) not just passive knowledge
req = requests.get("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Japanese")
assert req.status_code == requests.codes.ok
soup = bs4.BeautifulSoup(req.text, features="lxml")
word_soup = soup.select(".Jpan a")
hrefs = []
words = []
for row in word_soup:
hrefs.append(row.get("href"))
words.append(row.get("title"))
word_df = pd.DataFrame(
{
"words": words,
"hrefs": hrefs
}
)
return word_df
df = get_soup()
# Get 100 and 1000 most common words
top_100 = df.iloc[0:101]
top_1000 = df.iloc[0:1001]
# Random sample of 5 from 100 most common
sample = top_100.sample(5)
print(list(sample['words']))
and then I would rerun the last two rows and get new set everytime… I didn’t finish the whole thing, it’s just basic functionality
(also as you can probably guess I don’t really use it lol, as always I have an idea and never finish it)