[Userscript] Voice Input using Web Speech API

okonomichiyaki · June 26, 2023, 2:31am

Voice Input using Web Speech API

This is a userscript which enables you to do reviews and lessons hands-free using dictation. It relies on the Web Speech API which only works in some browsers. I’ve only tested it in Google Chrome, and it may not work in other browsers.

I’ve been using this for a few months now, but it’s not perfect and likely has some bugs. If you run into issues, please let me know here. I hope someone finds it useful.

Features

dictate in English for meanings and in Japanese for readings (automatically detects language)
works in reviews and lesson quizzes
support for user synonyms
built in (optional) lightning mode
commands for marking wrong and going to the next flashcard

How to install

install Tampermonkey or similar userscript plugin
(optional) install Wanikani Open Framework. This is only required to customize the script
install the Voice Input script via this Github link

How to use

Once it’s installed and enabled, when you start a review or a lesson quiz, your browser should ask for permission to use your microphone. If you allow this, you can then dictate in English or Japanese as appropriate for the flashcard. Dictate exactly what you would type in.

If you have not used speech recognition before, please be patient. It’s not perfect technology and requires a learning curve. For kanji flashcards especially, you may need to repeat yourself.

Commands

There are two commands to make it possible to complete a whole review session hands free. Simply say one of these words to trigger the behavior.

Mark a card incorrect: wrong, incorrect, mistake, 不正解, ふせいかい, 間違い, まちがい, だめ
Advance to the next card: next, つぎ, 次, ねくすと

Troubleshooting

There is a live transcript that should be giving you visual feedback. By default, this is black text on a gold background (colors are customizable) that will appear at the top of the screen:

If you suspect that it’s not working, this is the first thing to check. Try speaking a longer phrase, and ensure the script is actually able to hear you and the speech recognition is working. If this doesn’t work, make sure your microphone works in other software. Try this Web Speech API demo

If you can see the live transcript, but it’s not matching the flashcard, it’s possibly a bug or one of the many situations where it’s difficult to match speech to the expected answer. Here are some known examples:

punctuation. if it’s a meaning and is something like 屁理屈 “Far-Fetched Argument”, this should work. But there may be other words or phrases where punctuation causes issues
kanji readings. these readings are naturally sometimes only fragments of real Japanese words, and so the speech recognition is not always so great at identifying them, especially if they are short or have long vowels (eg しょう vs しょ)
other phrases, words, etc that just happen to be absent from the dictionaries I’m using, or homonyms for random loan words, proper nouns, etc. I have been collecting a small set of homonyms (eg speech recognition hears “EC2” instead of 遺失・いしつ) and the script accounts for those I have found, but please feel free to let me know of others you find.

Customizing

If you have WKOF installed, you can customize this script via the gear icon. The following features are customizable:

Live transcript on or off
Live transcript position
Live transcript text color and background color
Lightning mode (auto advance on correct answer)

Because of the way the Web Speech API streams results to the script, and the way the script switches language modes automatically when a flashcard changes, you may find the built in lightning mode more reliable than using lightning mode from another script like Double Check. If you have any bugs related to this, please let me know, but also try disabling other lightning modes and turning on the built in one.

gijsc1 · June 26, 2023, 4:26pm

I have no idea how the speech recognition works exactly, but it seems the problem is that it can’t recognize kanji readings because it is trying to fit to known words. You also talk about words not present in dictionaries. Could you not combine these two and add a dictionary with the kanji readings taught by wanikani? Such a thing is probably creatable using the wanikani API.

cmoncrab · June 26, 2023, 10:56pm

Wow! I haven’t tried this yet but well done for creating something so ambitious!

okonomichiyaki · June 26, 2023, 11:11pm

yes, it is trying to fit known words. my impression is that general purpose speech recognition models (WebSpeech in Chrome uses Google’s infrastructure, I think) are trained on natural speech and are better at whole phrases and sentences, not individual words, let alone fragments like kanji readings.

using the WK API to identify possible gaps in the dictionaries should make it work a bit better (especially for levels that I have not reached yet myself and have not tested). However, there are limits even with the entire database of WK subjects. Here’s an example of a problem that is not solveable with the WK API, that I ran into recently: the item 却って[かえって] does not appear in the dictionary I’m using (JMDICT). I can add it along with the reading, but that won’t help. Because the speech recognition always returns「帰って」 I can add this to the dictionary too, but I wouldn’t have discovered this without running into it myself in my reviews. I may try to use the WK audio files to somehow identify problematic items ahead of time if I can figure that out, or use a morphological analysis library to cover some cases more generally

Thanks, but the hard work is done on the machine learning side in speech recognition. WebSpeech API is pretty simple to use, although the edge cases might be annoying as described above.

Hopefully these edge cases do not give the wrong impression, or put off anyone from trying. I would say 90-95% of the time things work well, and I only have to type around 10% of review readings at worst

gijsc1 · June 27, 2023, 8:55am

Can you not get it to return all output in hiragana? Or is it stuck on using kanji?

okonomichiyaki · June 27, 2023, 11:46am

Sometimes it will return hiragana, sometimes katakana, and sometimes kanji. WebSpeech API doesn’t give you any control over this, as far as I can tell

gijsc1 · June 27, 2023, 8:32pm

I have been trying it out today and have some remarks:

There is no distinction between no sound and no valid input, which made debugging weather my mic is working or not or if I just cannot pronounce anything a bit difficult.

If you give a wrong or incorrectly recognized input, the accuracy and processing speed of any subsequent inputs for the same question goes down significantly. I suspect it is still trying to combine things it heard earlier with new inputs, instead of starting from scratch.

I don’t know what your problems were with doublecheck, but for me the doublecheck lightning mode works fine.

For me it seemed to be able to answer about 4 in 5 questions without requiring manual input.
The English recognition works better than the japanese one, but even that one gets stuck sometimes. For instance when I want to say mail and it keeps insisting I am saying male. Not sure if that is even solvable though.

okonomichiyaki · June 27, 2023, 9:01pm

Thanks for your feedback, I appreciate it.

Do you mean that when you were speaking, you only saw a live transcript for correct answers? It should always show a live transcript if it hears new speech, regardless of whether it’s the correct answer.

yes, if you say one thing, and it’s not the correct answer, you may need to pause briefly so that the next thing you say is isolated and evaluated on its own. on the other hand, if you said the correct thing but the recognition was misinterpreting it, and you repeat the same utterance, and there’s no feedback, it probably just heard the same thing again. there may be some improvements I can make to at least give more feedback in this case.

Double check lightning mode also mostly worked for me, but it would occasionally trip up the automatic language mode change, or result in my script submitting again on the next card. Glad it is working for you

I could incorporate a list of English homophones which would solve this. in the meantime, if you get annoyed by mail/male you can always add it to your user synonyms in WaniKani itself, and the voice script should pick that up

Thanks again for the feedback

gijsc1 · June 27, 2023, 9:12pm

Sometimes it would do this, but other times nothing shows up, even when pronouncing some entirely different word just to see if that would do something. There is no good way to tell if my mic is not picking up the sound correctly, the script is still calculating something and lagging behind, or if the sound it picked up cannot be fitted onto a known word. Some sort of indicator showing the internal state of the script would be nice. Something like a small icon/dot that changes when the script is picking up sound vs when it is waiting in standby.

Especially after trying the same word a couple times in a row, which is what you do when it is not picked up the first time, it seems to get stuck a lot. Even waiting a couple seconds and then pronouncing something easy like はい or ひと does not produce results then. But maybe I am just a bit too impatient.

Getting voice recognition to go from good to great is pretty difficult I think, but the script is already useful to practice pronunciation.

okonomichiyaki · June 27, 2023, 9:22pm

This is a good idea and I think I can incorporate something to provide more visual feedback in these cases

yes, as I said in the intro, it’s not perfect technology and is much fuzzier than how we usually interact with software, plus I have no control over the models themselves and how they are tuned or whatever. but I’m glad to hear it’s useful

cmoncrab · June 28, 2023, 12:20am

So I tried this last night and it doesn’t work on Brave. The mic won’t stay on. I know it’s not supported but thought you might like to know. From googling it seems to be a problem with the browser itself. It’s a pity as I was excited to give it a go! It’s also surprising because it’s based on chromium. Oh well!

nmunshi · August 23, 2023, 4:51am

When it works this is amazing, but if stops every 3rd or 4th card. I’d love to see this script improved in the future. Thanks !

gijsc1 · August 23, 2023, 12:39pm

I have also noticed that every now and then, the transcript correctly shows what I said, but the question does not advance. Typing in the exact same thing does work.

okonomichiyaki · August 24, 2023, 2:41am

I have been working on some changes to make it easier to understand the state of the voice recognition, and train yourself on the limitations. Sorry that this has taken longer than I had hoped, working out the best way to present info to you given the limitations of the API has not been obvious.

I have a new version available which includes new options for the live transcript:

configure it to display multiple transcripts
configure how long the transcripts remain on the screen

Additionally, I have changed what text is presented as a transcript: if the speed recognition engine hears any speech, you should always see some feedback. Sometimes, that will be nothing more than a microphone emoji like this:

This means that the recognition heard speech, but it did not give any data. My code gets an empty string “” from the engine, and there is of course not much you can do with that. In this screenshot, I had just uttered 「ちょう」 but nothing was recognized.

In my experience, this occurs often with certain utterances that need to be repeated, like「ちょう」but also「じゅう」and「しょう」, basically many kanji readings that do not have more context for recognition to work.

Note that if you say something and you see no feedback at all, that means the engine did not hear you at all. If you have this problem a lot, it may be worth trying an alternate microphone. I have done a lot of testing with a built in laptop mic and that works, however even just a mic on cheap earbuds works better, and my Blue Yeti mic works much better than that. Not always an option but it is worth considering

Hi, thank you for the feedback. Does it start working on the next card, after you type an answer in manually? (without a browser refresh)? If you can try the new version, and describe exactly in more detail what you see with the transcript, that will help.

There may have been a bug related to this might be fixed now. Let me know if it happens again with the new version, and if it does save the example if you can remember.

nmunshi · August 25, 2023, 6:38am

Hey, I haven’t upgraded to the new version yet, but to answer your questions the best I can for now…

As of tonight it did work as normal on the next cards without a browser refresh.

So far I’ve noticed it almost always fails on # days like. 6th day for ６日 or 5th day on ５日.And so-on. I can see the that the voice dialogue captures it, but no input into WaniKani. I should mention that I have lightning mode turned on as well.

gijsc1 · August 25, 2023, 9:01pm

For instance this question:

This is the exact answer. I had to repeat it multiple times in order to get the screenshot, and none of them counted as correct for some reason.

gijsc1 · August 26, 2023, 10:57pm

The exact same review came up today again, and this time it just worked. So whatever the problem is, it seems to not be based on the actual question.

nmunshi · August 27, 2023, 12:35am

Updated to 0.2v and now the input is extremely fast and snappy, which is nice, but it’s too fast to see the input. I even tried to add seconds to settings, but it’s not doing anything.

Also the input when speaking in Japanese is not as good as it was in 0.1.
When I say something it rarely works for Japanese kanji vocab or meanings. English words are working fine. It could be my pronunciation of course, but 0.1 was much better in capturing what I was trying to say.

Screenshot for non-input as that issue still remains for numbers.

Using Chrome Version 116.0.5845.111

okonomichiyaki · August 27, 2023, 1:56am

do you mean the live transcript disappears too fast?

nothing has changed about the recognition itself, it’s still just the Google speech recognition in the background. I did change the way I process results from Google, but the raw results from recognition are always shown in the transcript now. if the transcript does not even show what you have uttered, you may need to try another mic or experiment with your set up for audio (sitting closer or further away makes a difference in my experience with a built in laptop mic).

does it work better for Japanese readings of vocab vs kanji in your experience? this would match my experience. the recognition struggles to hear individual kanji readings as I described before, I believe this is a fundamental limitation of the engine. I plan to experiment with some optional features to help with this (for instance, when presented with a kanji card, speak a whole word that uses the expected reading)

this is a bug, and is something I need to fix in the code. many cards with numbers won’t work, it’s an oversight on my part

okonomichiyaki · August 27, 2023, 1:57am

thanks for the additional details. I have seen an issue in some reviews today with the very last card of a review session getting stuck, but I can’t reliably reproduce it. were you having issues with the last card by any chance?

Topic		Replies	Views
Userscript to accept kanji in answers to vocabulary? API And Third-Party Apps	8	414	August 3, 2023
Wanikani + Speech Recognition =? API And Third-Party Apps	31	4963	August 5, 2018
Is there a UserScript that reads the English out loud when you type the right meaning? API And Third-Party Apps	17	698	December 4, 2020
Speech to Text Extension for Inputting Wanikani Answers? API And Third-Party Apps	3	249	June 2, 2024
Is voice controll possable? API And Third-Party Apps	11	1315	April 28, 2020