Author Topic: Casual Musings on the Evolutions of TTS Technology (Read 2854 times)

Mike308 · « **on:** April 01, 2018, 12:25:41 PM »

So I just read the latest blurb on the ongoing evolution of AI-driven cloud TTS.
https://www.theverge.com/2018/3/27/17167200/google-ai-speech-tts-cloud-deepmind-wavenet

I realize that a cloud solution is NOT the same as the speech engine internal to the Windows OS. What I ponder for the moment is, with the fast processing times noted in the article, if the time-lag to process speech online would be acceptable to an end user for all but the shortest-response loops. To clarify that:

If I give a combat command like "fire missile" I want an instantaneous reply like "Missiles away!". Those replies are typically very short and fixed in content, so users wanting 'lifelike voices" have voicepack options that are beautiful.

If, however, I want the weather forecast for a distant planet, a update on the economy, or just about anything like a Tony Stark 'chat with Jarvis' then I hit the fork we all know - I cannot have canned (recorded) responses for everything and in-engine TTS is just a bit stilted, even with things like SSML editing that I've been doing.

Would I notice a one- or two-second processing delay in a response if I posed a non-combat question? Probably not.

Admittedly, this is approaching the "would we want to" side of the question, not the "can we do it" side. The latter is way above my technology pay grade. But if there was a compelling use-case in support of such a capability, would it be within the realm of the possible?

(note: there may be things like licensing or 'per-word costs' to use an online engine that have not been considered in the above.)

I'd love to know what the Rocket Scientists of Voice Attack think about the direction TTS is evolving, and if the in-engine TTS will ultimately languish in favor of commercial focus on the cloud?

Rhaedas · « **Reply #1 on:** April 01, 2018, 04:39:41 PM »

The simple way to mask the time to access the cloud, process, and return the results is to cover most of that time with a confirmation dialogue while it's happening. The stock "Yes, sir, I'll look that up now", or for a JARVIS effect, some sarcastic backtalk as it's done. Perhaps even parsing if there are multiple parts to the request that can be divided up, and while the rest of it is being done, start with the first sections. And lastly, a method of recovering gracefully TTS-wise from an unusually long wait time or failure.

But part of your question and I think a bigger challenge is more on speech recognition and parsing what's being asked for. Just look at many posts on here, often times it's how to get a better recognition of the words. If the engine doesn't understand what you want, the end problem of TTS is a bit moot.

One interesting TTS solution for the common phrases that someone here or on one of the the Elite board did was to store results from the cloud engine they were querying, so the first time a new phrase was needed it was sent to the cloud, but from then on if it was found in the database the stored response was played. This was done to audibly play NPC text responses. Combined with the other techniques, you could hide what was instant and what wasn't, and after a while not need the cloud so much.

Author Topic: Casual Musings on the Evolutions of TTS Technology (Read 2854 times)

Mike308

Casual Musings on the Evolutions of TTS Technology

Rhaedas

Re: Casual Musings on the Evolutions of TTS Technology