Siri TTS vs Amazon Polly TTS

Hi everybody,
I’m new here, started my Japanese language journey several weeks ago, and trying to figure out the best TTS engine to voice WaniKani sample sentences, as I find those quite helpful to see Kanji in context.

Prior MacOS Ventura Apple’s own TTS was subpar: very flat in terms of pitch, sounded robotic and made lots of reading mistakes. In Ventura however, built in Japanese Siri voice improved significantly, it is now actually very “melodic” and human-like. Yet, it still makes same mistakes in readings!

Another alternative I found is Amazon Polly, which is a bit too technical to set up, but free to use till certain (quite generous) quota is reached. These days it sounds a bit worse than Siri, but it does not make any errors in readings. Polly’s pitch accent is a little bit more expressive than Siri’s one, but it’s no longer a deal breaker

Today I spent a couple hours and mashed up a simple Chrome extension, just to compare both engines, and at this point I wanted to clarify: are there any other solutions to this problem? I could not believe that by default sentences are not voiced in WaniKani, and there are very few discussions about TTS on this forum, maybe I am missing something?

Here’s an example of both engines: you can clearly hear Siri makes a mistake on the very first mora, and it happens quite often unfortunately – nearly in every batch of sentences she’ll find a way to screw up, which is extremely upsetting considering how much better it sounds than Polly… And it’s been like this for years, so I don’t expect Apple to address it any time soon.

If there’s any interest for such Polly extension for Chrome, I can make it user-friendly, publish and open-source it, so everyone can use it, but first I wanted to ask – what do you guys use to voice the sentences?



I do find it quite strange that Siri defaults 五月 to the name (or old word) さつき instead of the (obviously very common) word ごがつ.

1 Like

The Advanced Context Sentence userscript offers TTS for the context sentences, but only uses voices provided by the webbrowser through the Web Speech API, or alternatively the Google Translate TTS. I personally really like “Microsoft Nanami” which is available in Microsoft Edge.

The Google Translate TTS seems to have the same problem with 五月.


I find the anime context sentences userscript to be far more helpful than WK’s context sentences, and it comes with native audio:)
Just a tip in case this seems interesting to you


Have you tried Azure’s Neural TTS?

Presumably, Azure can work out Furigana by SSML, but not only Furigana has to be specified – pitch too.

It’s better to get native audio sentences, imo. (See, really really sleepy person above.) Relatively recently, I have found this Anki deck, but I am not sure how native it is.

Azure TTS demo is broken, so I cannot test it w/o signing up and building a small app myself, but I am curious to see whether it makes same mistake as Siri and Google in the sentence:

Thanks to AI models, most of long-standing TTS problems have been solved and nowadays developers can focus on more advanced language featuers, like perfecting pitch, which was not a priority 3-4 years ago from what I could gather. The reason to choose Polly is because they specifically focus on pitch, and you can clearly hear it in generated sentences. In addition, if Polly gets something wrong, it can be fixed with tags that specify word boundaries and pitch contours:

To manually pass all WK sentences thru Polly and add necessary tags would be a very involved process obviously and should be done by a proficient learner or a native speaker, but considering how basic most WK sentences are, Polly already does an amazing job w/o any extra labour.

1 Like

Hmmm… works fine for me. Just clicked the link, scroll down, select a Japanese speaker, paste text, hit Play.

Nanami is one of the Azure Neural TTS voices. I’m not sure if the Edge version is the same Nanami, or possibly a slimmed down version. I haven’t looked into it.

1 Like

Actually, I am not sure about this one. Vocabulary and grammar -wise, probably not so much. But 交ぜ書き can be difficult to parse, for TTS too, because they are different from usual adult native-intended texts.

Though, of course they can be manually fixed, just like word boundaries.