How to count words in Japanese?

TL;DR - how do you count words in Japanese? I’m not asking if there’s a specific counter for words, but what counts as a word. Is 行きます a word? Or is 行き one and ます another? If you write “私は学生です”, is that four words, or three, or…?

Context: I was talking with a friend earlier - we both work as professional translators, translating English to Spanish. In that line of work, you charge a certain amount of money per word, and half the time the client doesn’t give you the number of words in his document or sends the file in a format that doesn’t allow for counting software. So you end up either counting loads of words or giving the client a rough estimation and hoping you’re not selling yourself short.

So not that I’ll be translating Japanese professionally any time soon, but it made me wonder how would you even do that in Japanese. Do translators from Japanese charge by page or by line or by character instead of by word?

I tried looking around in google but didn’t find much (yet). I did find this link, explaining how words can be separated but it’s not quite what I was thinking of.

9 Likes

It seems that they indeed charge per character, at least this website cites the price of 3500 JPY per 400 characters English to Japanese. Or you can also calculate based on number of words in source language, in which case it’s 27 JPY per word English to Japanese.

https://www.ccfj.com/translation/translation_eg.html

7 Likes

Thanks! I’ve never ever been paid based on a finished translation, but it might be the best mechanic for a language like Japanese.

1 Like

The question about how much it costs to have something translated is a little less interesting than the question in general, so I’ll answer that instead (heh).

I was perusing a Japanese language textbook for native elementary school students recently, and came upon this example sentence that was divided into its constituent 単語.

花が咲いた

  1. 花 2) が 3) 咲い 4) た

So that’s probably a bit different from how English speakers would think of it. It shows that while we translate 単語 as “word” the differences inherent to the languages could throw people off.

10 Likes

Thanks! That’s very much what I was interested in, too… and what you’re saying seems to line up with what I read in the link in my post.

It is interesting. Instinctively, I would’ve thought that “花が咲いた” has three words; since I’d count the conjugated verb as one word, or maybe even two words and a particle (が). I never would’ve thought to separate 咲い and た!

3 Likes

Yeah, I’m not sure if it makes sense to say that 花が咲いた has 4 “words” if we use the English meaning, but it has 4 単語 apparently.

1 Like

I guess if we think of… linguistic items (??) instead of words it makes slightly more sense? That way you have something like

  1. 花 (noun)
  2. が (particle)
  3. 咲い (verb root?)
  4. た (verb declension?)

I’ve actually seen the verb endings called “auxiliary verbs” in their own right. Like “the auxiliary verb た”.

And a brief check of a Japanese dictionary does seem to support that.

https://www.weblio.jp/content/た

Scroll down to the one that has a bunch of definitions and that’s the た in question. It’s labeled as 助動 (auxiliary verb).

3 Likes

Thanks! That’s really interesting, and so different from the way western language grammar (in my experience) approaches verbs.

2 Likes

This is one fine topic. Leebosan, you are the man.

2 Likes

I would have never thought that. That’s a very interesting way to think about it for sure… I’ll have to look into it more, if only for curiosity’s sake.

1 Like

iKnow.jp includes spaces between “words” for their kana versions of sentences, and some of the breaks are in interesting places. The two interesting ones I’ve seen are the auxiliary verb いる after the て form, which kind of makes sense, and the explanatory ん. But they don’t actually put a space before a た like your native textbook explains, so they may just be doing whatever they want and not following some official standard.

For example:
彼はシートベルトをしていたので助かったんだ。

Broken into:


シートベルト

して
いた
ので
助かった

I think that does roughly conform to an English speaker’s sense of “what is a word”. Though yeah, I think most people would consider んだ a contraction of のだ and treat it as one word, but it’s understandable to separate it too.

1 Like

I haven’t used iknow.jp in a while, but I can’t remember that. Is it only in their kana sentences that they use the separation?

That’s correct.

1 Like

My grad student, Daichi Kuroiwa, and I are currently working on a project for a test which will count how many words a person knows in Japanese (both natives and non-natives) so we have been thinking about this issue!

The issue of what to consider a word is problematic in any language. In English, there are three different ways that researchers have counted words:

a) Types. This is the official term for considering each individual word to be a separate word. This means that go, going, goes, went and gone would all be counted as separate words. The obvious problem with this is that the changes are totally predictable (actually, for a word like like, likes, liked the changes are predictable but of course English has many frequent irregular verbs like this. However, because these changes are predictable, some researchers have thought to group words as lemmas.

b) Lemmas: These are groups of words which have the same form except for very predictable morphological* endings. In this case, for the word go, the lexeme go includes , going, goes, went and gone. For the lexeme for a noun, it would include all of the regular noun morphology in English: cat, cats, cat’s. For an adjective, it would include all of the regular adjective morphology: stinky, stinkier, stinkiest. In other words, a lemma includes the headword and its various inflected forms

*These endings are called inflectional morphology in English and there are only 8 categories

  1. (V); third person singular (goes, hunts)
  2. (V); past tense (went, hunted)
  3. (V); present progressive (going, hunting)
  4. (V); past participle (went, hunted)
  5. (N); plural (cats, children)
  6. (N); possessive (cat’s children’s)
  7. (Adj); comparative form (stinkier)
  8. (Adj); superlative form (stinkiest)

However, you might think, well, that’s not enough. For example, all native speakers would know that the morpheme* {-ly} marks a word as an adverb. And you might think that everyone knows that {-tion} just means that the word is a noun, and is derived from a verb (hesitate hesitation), and {-ment} does the same thing (govern government). Researchers who think this have created the concept of a word family.

*A morpheme is the smallest part that a word can be divided into

c) Word family: groups of words which include the base word, its inflected forms, and closely related words with common derivational affixes (like –ment, -ly, un-, -ness and –tion). The problem here is that often times even native speakers do not know what the related word mean! For example, I was reading something about CPR and one category was called “Agonal breathing”. Agonal? I didn’t know what that word meant, I thought at first. I thought a little longer and realized it must be an adjectival form from agony. But I had never heard it and even though it used a very productive adjectival form {-al} (i.e., seasonal) I had a hard time recognizing it. Even more than native speakers then, non-native speakers have been shown to not understand the related words even if they know the base word. Researchers have admitted that the problem with the concept of word families is deciding what words should be considered as having ‘closely related’ forms. Bauer and Nation (1993, International Journal of Lexicography) set up a scale of morphological forms, moving from very elementary to forms that are less likely to be known.

So course we get wildly different estimates of how many words a person knows depending on which of these definitions we use. Nation (2013: Learning Vocabulary in Another Language) says that the native speaker of English knows just under 20,000 word families on average. Van Heuven, Mandera, Keuleers & Brysbaert (2014, Quarterly Journal of Experimental Psychology) estimates that by age 20, the average person has heard somewhere between 81,000 and 292,000 word tokens (depending on how much you read!). They tested about 220,000 people on the words they knew and found the average 20-year-old knows 42,000 lemmas and 11,100 word families (meanwhile, a 60-year-old does learn! They know 48,200 lemmas and 13,400 word families on average).

So how do we count words in Japanese? In the largest corpus of written and spoken Japanese available to date (the Balanced Corpus of Contemporary Written Japanese (BCCWJ) and the Corpus of Spontaneous Japanese (CSJ), developed by National Institute for Japanese Language Linguistics (NINJAL) and available online) the word family approach seems to be used, with the stipulation that only the most frequent and regular derivational affixes are considered part of the word family. For example, 重い(おもい;heavy)and思さ(おもさ;heaviness)are counted as one word, while重み(おもみ;importance)and重り(おもり;weight like a piece of metal)are considered different words, even though they share the same core and have similar concepts. My graduate student, Daichi Kuroiwa, considers that Japanese word families in this corpus include morphology up through what Bauer and Nation (cited above) count as Level 3, which is “The most frequent and regular derivational affixes”.

Of course, the issue of multi-word units that function as one word (as mentioned by some other posters) is also a question and many researchers think these should be included too, but it is hard to get computers to do this automatically, so such multi-word units have often not been included in word counts.

10 Likes

By the way, I came to this forum hoping to find out exactly how many vocabulary words are learned in Wanikani. I’m on Level 20 and have a little over 3000 words in my dashboard. I wondered if the number of vocab per level got less as I got higher, because I seemed to remember something about 6000 words (and I have seen 6300 quoted as the official number) total, but I glanced through the higher levels and they seem to all have around 100 words each. So I’m not sure the total number is going to be 6000. It seems like it should be around 9000. Can anyone comment on this?

Looking at wkstats.com, it seems to be 6300 words, indeed.
%E7%AF%84%E5%9B%B2%E3%82%92%E9%81%B8%E6%8A%9E_129

The total number of items (including radicals and kanji) is close to 9000, if that’s what you meant.

2 Likes

Ah, I guess so, since what’s in my dashboard includes the kanji and the radicals.

I actually want to know, what counter should be used for words ?