[Userscript] Voice Input using Web Speech API

Voice Input using Web Speech API

This is a userscript which enables you to do reviews and lessons hands-free using dictation. It relies on the Web Speech API which only works in some browsers. I’ve only tested it in Google Chrome, and it may not work in other browsers.

I’ve been using this for a few months now, but it’s not perfect and likely has some bugs. If you run into issues, please let me know here. I hope someone finds it useful.

Features

  • dictate in English for meanings and in Japanese for readings (automatically detects language)
  • works in reviews and lesson quizzes
  • support for user synonyms
  • built in (optional) lightning mode
  • commands for marking wrong and going to the next flashcard

How to install

How to use

Once it’s installed and enabled, when you start a review or a lesson quiz, your browser should ask for permission to use your microphone. If you allow this, you can then dictate in English or Japanese as appropriate for the flashcard. Dictate exactly what you would type in.

If you have not used speech recognition before, please be patient. It’s not perfect technology and requires a learning curve. For kanji flashcards especially, you may need to repeat yourself.

Commands

There are two commands to make it possible to complete a whole review session hands free. Simply say one of these words to trigger the behavior.

  • Mark a card incorrect: wrong, incorrect, mistake, 不正解, ふせいかい, 間違い, まちがい, だめ
  • Advance to the next card: next, つぎ, 次, ねくすと

Troubleshooting

There is a live transcript that should be giving you visual feedback. By default, this is black text on a gold background (colors are customizable) that will appear at the top of the screen:

If you suspect that it’s not working, this is the first thing to check. Try speaking a longer phrase, and ensure the script is actually able to hear you and the speech recognition is working. If this doesn’t work, make sure your microphone works in other software. Try this Web Speech API demo

If you can see the live transcript, but it’s not matching the flashcard, it’s possibly a bug or one of the many situations where it’s difficult to match speech to the expected answer. Here are some known examples:

  • punctuation. if it’s a meaning and is something like 屁理屈 “Far-Fetched Argument”, this should work. But there may be other words or phrases where punctuation causes issues
  • kanji readings. these readings are naturally sometimes only fragments of real Japanese words, and so the speech recognition is not always so great at identifying them, especially if they are short or have long vowels (eg しょう vs しょ)
  • other phrases, words, etc that just happen to be absent from the dictionaries I’m using, or homonyms for random loan words, proper nouns, etc. I have been collecting a small set of homonyms (eg speech recognition hears “EC2” instead of 遺失・いしつ) and the script accounts for those I have found, but please feel free to let me know of others you find.

Customizing

If you have WKOF installed, you can customize this script via the gear icon. The following features are customizable:

  • Live transcript on or off
  • Live transcript position
  • Live transcript text color and background color
  • Lightning mode (auto advance on correct answer)

Because of the way the Web Speech API streams results to the script, and the way the script switches language modes automatically when a flashcard changes, you may find the built in lightning mode more reliable than using lightning mode from another script like Double Check. If you have any bugs related to this, please let me know, but also try disabling other lightning modes and turning on the built in one.

8 Likes

I have no idea how the speech recognition works exactly, but it seems the problem is that it can’t recognize kanji readings because it is trying to fit to known words. You also talk about words not present in dictionaries. Could you not combine these two and add a dictionary with the kanji readings taught by wanikani? Such a thing is probably creatable using the wanikani API.

1 Like

Wow! I haven’t tried this yet but well done for creating something so ambitious!

1 Like

yes, it is trying to fit known words. my impression is that general purpose speech recognition models (WebSpeech in Chrome uses Google’s infrastructure, I think) are trained on natural speech and are better at whole phrases and sentences, not individual words, let alone fragments like kanji readings.

using the WK API to identify possible gaps in the dictionaries should make it work a bit better (especially for levels that I have not reached yet myself and have not tested). However, there are limits even with the entire database of WK subjects. Here’s an example of a problem that is not solveable with the WK API, that I ran into recently: the item 却って[かえって] does not appear in the dictionary I’m using (JMDICT). I can add it along with the reading, but that won’t help. Because the speech recognition always returns「帰って」 I can add this to the dictionary too, but I wouldn’t have discovered this without running into it myself in my reviews. I may try to use the WK audio files to somehow identify problematic items ahead of time if I can figure that out, or use a morphological analysis library to cover some cases more generally

Thanks, but the hard work is done on the machine learning side in speech recognition. :sweat_smile: WebSpeech API is pretty simple to use, although the edge cases might be annoying as described above.

Hopefully these edge cases do not give the wrong impression, or put off anyone from trying. I would say 90-95% of the time things work well, and I only have to type around 10% of review readings at worst

3 Likes

Can you not get it to return all output in hiragana? Or is it stuck on using kanji?

Sometimes it will return hiragana, sometimes katakana, and sometimes kanji. WebSpeech API doesn’t give you any control over this, as far as I can tell

I have been trying it out today and have some remarks:

There is no distinction between no sound and no valid input, which made debugging weather my mic is working or not or if I just cannot pronounce anything a bit difficult.

If you give a wrong or incorrectly recognized input, the accuracy and processing speed of any subsequent inputs for the same question goes down significantly. I suspect it is still trying to combine things it heard earlier with new inputs, instead of starting from scratch.

I don’t know what your problems were with doublecheck, but for me the doublecheck lightning mode works fine.

For me it seemed to be able to answer about 4 in 5 questions without requiring manual input.
The English recognition works better than the japanese one, but even that one gets stuck sometimes. For instance when I want to say mail and it keeps insisting I am saying male. Not sure if that is even solvable though.

1 Like

Thanks for your feedback, I appreciate it.

Do you mean that when you were speaking, you only saw a live transcript for correct answers? It should always show a live transcript if it hears new speech, regardless of whether it’s the correct answer.

yes, if you say one thing, and it’s not the correct answer, you may need to pause briefly so that the next thing you say is isolated and evaluated on its own. on the other hand, if you said the correct thing but the recognition was misinterpreting it, and you repeat the same utterance, and there’s no feedback, it probably just heard the same thing again. there may be some improvements I can make to at least give more feedback in this case.

Double check lightning mode also mostly worked for me, but it would occasionally trip up the automatic language mode change, or result in my script submitting again on the next card. Glad it is working for you

I could incorporate a list of English homophones which would solve this. in the meantime, if you get annoyed by mail/male you can always add it to your user synonyms in WaniKani itself, and the voice script should pick that up

Thanks again for the feedback

1 Like

Sometimes it would do this, but other times nothing shows up, even when pronouncing some entirely different word just to see if that would do something. There is no good way to tell if my mic is not picking up the sound correctly, the script is still calculating something and lagging behind, or if the sound it picked up cannot be fitted onto a known word. Some sort of indicator showing the internal state of the script would be nice. Something like a small icon/dot that changes when the script is picking up sound vs when it is waiting in standby.

Especially after trying the same word a couple times in a row, which is what you do when it is not picked up the first time, it seems to get stuck a lot. Even waiting a couple seconds and then pronouncing something easy like はい or ひと does not produce results then. But maybe I am just a bit too impatient.

Getting voice recognition to go from good to great is pretty difficult I think, but the script is already useful to practice pronunciation.

This is a good idea and I think I can incorporate something to provide more visual feedback in these cases

yes, as I said in the intro, it’s not perfect technology and is much fuzzier than how we usually interact with software, plus I have no control over the models themselves and how they are tuned or whatever. but I’m glad to hear it’s useful

1 Like

So I tried this last night and it doesn’t work on Brave. The mic won’t stay on. I know it’s not supported but thought you might like to know. From googling it seems to be a problem with the browser itself. It’s a pity as I was excited to give it a go! It’s also surprising because it’s based on chromium. Oh well!

1 Like

When it works this is amazing, but if stops every 3rd or 4th card. I’d love to see this script improved in the future. Thanks !

I have also noticed that every now and then, the transcript correctly shows what I said, but the question does not advance. Typing in the exact same thing does work.

I have been working on some changes to make it easier to understand the state of the voice recognition, and train yourself on the limitations. Sorry that this has taken longer than I had hoped, working out the best way to present info to you given the limitations of the API has not been obvious.

I have a new version available which includes new options for the live transcript:

  • configure it to display multiple transcripts
  • configure how long the transcripts remain on the screen

Additionally, I have changed what text is presented as a transcript: if the speed recognition engine hears any speech, you should always see some feedback. Sometimes, that will be nothing more than a microphone emoji like this:

This means that the recognition heard speech, but it did not give any data. My code gets an empty string “” from the engine, and there is of course not much you can do with that. In this screenshot, I had just uttered 「ちょう」 but nothing was recognized.

In my experience, this occurs often with certain utterances that need to be repeated, like「ちょう」but also「じゅう」and「しょう」, basically many kanji readings that do not have more context for recognition to work.

Note that if you say something and you see no feedback at all, that means the engine did not hear you at all. If you have this problem a lot, it may be worth trying an alternate microphone. I have done a lot of testing with a built in laptop mic and that works, however even just a mic on cheap earbuds works better, and my Blue Yeti mic works much better than that. Not always an option but it is worth considering

Hi, thank you for the feedback. Does it start working on the next card, after you type an answer in manually? (without a browser refresh)? If you can try the new version, and describe exactly in more detail what you see with the transcript, that will help.

There may have been a bug related to this might be fixed now. Let me know if it happens again with the new version, and if it does save the example if you can remember.

1 Like

Hey, I haven’t upgraded to the new version yet, but to answer your questions the best I can for now…

As of tonight it did work as normal on the next cards without a browser refresh.

So far I’ve noticed it almost always fails on # days like. 6th day for 6日 or 5th day on 5日.And so-on. I can see the that the voice dialogue captures it, but no input into WaniKani. I should mention that I have lightning mode turned on as well.

1 Like

For instance this question:

This is the exact answer. I had to repeat it multiple times in order to get the screenshot, and none of them counted as correct for some reason.

1 Like

The exact same review came up today again, and this time it just worked. So whatever the problem is, it seems to not be based on the actual question.

Updated to 0.2v and now the input is extremely fast and snappy, which is nice, but it’s too fast to see the input. I even tried to add seconds to settings, but it’s not doing anything.

Also the input when speaking in Japanese is not as good as it was in 0.1.
When I say something it rarely works for Japanese kanji vocab or meanings. English words are working fine. It could be my pronunciation of course, but 0.1 was much better in capturing what I was trying to say.

Screenshot for non-input as that issue still remains for numbers.

Using Chrome Version 116.0.5845.111

do you mean the live transcript disappears too fast?

nothing has changed about the recognition itself, it’s still just the Google speech recognition in the background. I did change the way I process results from Google, but the raw results from recognition are always shown in the transcript now. if the transcript does not even show what you have uttered, you may need to try another mic or experiment with your set up for audio (sitting closer or further away makes a difference in my experience with a built in laptop mic).

does it work better for Japanese readings of vocab vs kanji in your experience? this would match my experience. the recognition struggles to hear individual kanji readings as I described before, I believe this is a fundamental limitation of the engine. I plan to experiment with some optional features to help with this (for instance, when presented with a kanji card, speak a whole word that uses the expected reading)

this is a bug, and is something I need to fix in the code. many cards with numbers won’t work, it’s an oversight on my part

thanks for the additional details. I have seen an issue in some reviews today with the very last card of a review session getting stuck, but I can’t reliably reproduce it. were you having issues with the last card by any chance?