My grad student, Daichi Kuroiwa, and I are currently working on a project for a test which will count how many words a person knows in Japanese (both natives and non-natives) so we have been thinking about this issue!
The issue of what to consider a word is problematic in any language. In English, there are three different ways that researchers have counted words:
a) Types. This is the official term for considering each individual word to be a separate word. This means that go, going, goes, went and gone would all be counted as separate words. The obvious problem with this is that the changes are totally predictable (actually, for a word like like, likes, liked the changes are predictable but of course English has many frequent irregular verbs like this. However, because these changes are predictable, some researchers have thought to group words as lemmas.
b) Lemmas: These are groups of words which have the same form except for very predictable morphological* endings. In this case, for the word go, the lexeme go includes , going, goes, went and gone. For the lexeme for a noun, it would include all of the regular noun morphology in English: cat, cats, cat’s. For an adjective, it would include all of the regular adjective morphology: stinky, stinkier, stinkiest. In other words, a lemma includes the headword and its various inflected forms
*These endings are called inflectional morphology in English and there are only 8 categories
- (V); third person singular (goes, hunts)
- (V); past tense (went, hunted)
- (V); present progressive (going, hunting)
- (V); past participle (went, hunted)
- (N); plural (cats, children)
- (N); possessive (cat’s children’s)
- (Adj); comparative form (stinkier)
- (Adj); superlative form (stinkiest)
However, you might think, well, that’s not enough. For example, all native speakers would know that the morpheme* {-ly} marks a word as an adverb. And you might think that everyone knows that {-tion} just means that the word is a noun, and is derived from a verb (hesitate hesitation), and {-ment} does the same thing (govern government). Researchers who think this have created the concept of a word family.
*A morpheme is the smallest part that a word can be divided into
c) Word family: groups of words which include the base word, its inflected forms, and closely related words with common derivational affixes (like –ment, -ly, un-, -ness and –tion). The problem here is that often times even native speakers do not know what the related word mean! For example, I was reading something about CPR and one category was called “Agonal breathing”. Agonal? I didn’t know what that word meant, I thought at first. I thought a little longer and realized it must be an adjectival form from agony. But I had never heard it and even though it used a very productive adjectival form {-al} (i.e., seasonal) I had a hard time recognizing it. Even more than native speakers then, non-native speakers have been shown to not understand the related words even if they know the base word. Researchers have admitted that the problem with the concept of word families is deciding what words should be considered as having ‘closely related’ forms. Bauer and Nation (1993, International Journal of Lexicography) set up a scale of morphological forms, moving from very elementary to forms that are less likely to be known.
So course we get wildly different estimates of how many words a person knows depending on which of these definitions we use. Nation (2013: Learning Vocabulary in Another Language) says that the native speaker of English knows just under 20,000 word families on average. Van Heuven, Mandera, Keuleers & Brysbaert (2014, Quarterly Journal of Experimental Psychology) estimates that by age 20, the average person has heard somewhere between 81,000 and 292,000 word tokens (depending on how much you read!). They tested about 220,000 people on the words they knew and found the average 20-year-old knows 42,000 lemmas and 11,100 word families (meanwhile, a 60-year-old does learn! They know 48,200 lemmas and 13,400 word families on average).
So how do we count words in Japanese? In the largest corpus of written and spoken Japanese available to date (the Balanced Corpus of Contemporary Written Japanese (BCCWJ) and the Corpus of Spontaneous Japanese (CSJ), developed by National Institute for Japanese Language Linguistics (NINJAL) and available online) the word family approach seems to be used, with the stipulation that only the most frequent and regular derivational affixes are considered part of the word family. For example, 重い(おもい;heavy)and重さ(おもさ;heaviness)are counted as one word, while重み(おもみ;importance)and重り(おもり;weight like a piece of metal)are considered different words, even though they share the same core and have similar concepts. My graduate student, Daichi Kuroiwa, considers that Japanese word families in this corpus include morphology up through what Bauer and Nation (cited above) count as Level 3, which is “The most frequent and regular derivational affixes”.
Of course, the issue of multi-word units that function as one word (as mentioned by some other posters) is also a question and many researchers think these should be included too, but it is hard to get computers to do this automatically, so such multi-word units have often not been included in word counts.