Author Topic: VoiceAttack in the age of AI (Read 2998 times)

Mike308 · « **on:** May 01, 2023, 03:49:19 PM »

Hi Gary,
I have had tremendous success developing my own AI text-to-speech voice through an online resource. It takes 10-15 seconds to respond, but I am guessing that the lap time will get shorter and several AI tools can be downloaded and run locally, which would likely pick up the pace quite a bit.

This begs my question: have you looked into how VoiceAttack might be a structured context engine for function-driven voice-to-voice communication with the computer? i.e. I ask a question, VoiceAttack determines if this can be handled with a canned response or some action, or if the question needs to be fed into some GPT to be answered in what sounds like a live voice, with no need for SSML tweaking. Just curious, but it feels like just about all of the necessary Legos are on the table needing only to be assembled.

Best always,
Mike

RanmaKei · « **Reply #1 on:** November 08, 2023, 10:04:15 PM »

Absolutely agree with you on this. AI is being integrated into everything and you can access your own language models for development locally on your system using llama 2. If you leverage llama 2 you can integrate a language model into voice attack that is natural and will pick up all the different phrases you can think of for voice commands.

SemlerPDX · « **Reply #2 on:** November 09, 2023, 11:56:43 AM »

Most users do not have a system capable of such AI processing in an acceptable timeframe. The reason you get such a long return time from AI text-to-speech generation is the limitation of the resources being assigned to the task, and the service dictating those limitations in order to serve many people using the tool.

Local AI processing will be a far off technology for that reason, and presently, accessing AI tools through API's allows us to get responses in mere seconds rather than what would be much longer through local AI processing depending on the local hardware.

That being said, such local hardware would need to be extremely high-end for a local implementation of Llama 2:

Quote

A high-end consumer GPU, such as the NVIDIA RTX 3090 or 4090, has 24 GB of VRAM. If we quantize Llama 2 70B to 4-bit precision, we still need 35 GB of memory (70 billion * 0.5 bytes). The model could fit into 2 consumer GPUs.

With GPTQ quantization, we can further reduce the precision to 3-bit without losing much in the performance of the model. A 3-bit parameter weighs 0.375 bytes in memory. Llama 2 70B quantized to 3-bit would still weigh 26.25 GB. It doesn’t fit into one consumer GPU.

For now, the best way to leverage AI systems in VoiceAttack is through API access to existing online tools, such as OpenAI and others. Just recently, OpenAI has released a new tool in open beta for Text-to-Speech, offering several extremely human-like voices for a fairly affordable price for the quality on offer.

In addition to this, the Microsoft Azure Cognitive Voice services can also be accessed through VoiceAttack using this plugin by jamescl604.

Unfortunately, program development is not as simple as cobbling together a bunch of parts like assembling a Lego creation, and presently local AI processing is beyond the reach of all but the highest end PC's, so implementing these new technologies into VoiceAttack as a native function is not as simple as 'insert tab A into slot A'.

For function driven voice communications with the computer, where input is first determined to either perform an action or provide a pre-programmed response, or to feed it instead to ChatGPT to formulate a response, I have already developed and released the OpenAI API Plugin for VoiceAttack. It provides access to most all OpenAI Technologies functions including Whisper speech-to-text, ChatGPT, Dall-E image generation/editing, and Completion, as well as Fine-Tuning and Embedding (which can be used for extremely powerful customizations and database interactions).

The new Assistants API (also in beta) by OpenAI is a very promising feature, and will take a lot of the work load off developers such as myself attempting to use current OpenAI tools to approximate such features already - it can be customized to execute pre-defined functions such as API calls (i.e. OpenWeather) based on user input at the same time it provides customized responses to user inputs, as well as other specific task such as files handling. It's so new, I'm still wrapping my head around all it can offer, and how it can be used. Once it is out of beta, I will include access to this API through my OpenAI Plugin for VoiceAttack, as well as any new models not presently supported by my plugin.

Right now, you can check out that awesome plugin by jamescl604 to get very realistic text-to-speech today in VoiceAttack, or you could develop a plugin which can take advantage of the Llama 2 API (if it has one) or the latest text-to-speech beta API by OpenAI. Definitely check out my OpenAI Plugin for VoiceAttack, and my AVCS CHAT profile which allows for verbal conversations with ChatGPT where responses are spoken back to users using our local system's text-to-speech voice, opening any code blocks in a notepad for manual review, etc.

Gary · « **Reply #3 on:** November 14, 2023, 10:02:20 AM »

Just a heads up and to let folks know that this thread is not being ignored - I am in collaborative talks with other parties, exploring the future of AI within VoiceAttack. I'm not at the point just yet where any of this can be discussed, as nothing is firm. However, the prospect of AI both within the software itself and what can be done in conjunction with VoiceAttack is very promising. Stay tuned!

Author Topic: VoiceAttack in the age of AI (Read 2998 times)

Mike308

VoiceAttack in the age of AI

RanmaKei

Re: VoiceAttack in the age of AI

SemlerPDX

Re: VoiceAttack in the age of AI

Gary

Re: VoiceAttack in the age of AI