Author Topic: Prevent recognition until "Recognition Global Hotkey" is released  (Read 4832 times)

Serge

  • Newbie
  • *
  • Posts: 12
Hello all,

In the application's general settings, I define a key that must be pressed to speak (Recognition Global Hotkey) with the option "VoiceAttack listens while keys are down".


The problem is that this key only opens or closes the microphone. As a result, if I pause in my diction even while holding down the key, the system tries to recognize what I'm saying.

What I'd like to do is prevent the system from trying to recognize what I'm saying until I release the key. This way, I can speak quietly as long as I hold the button.

How can I do this?

thank.

Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4792
  • RTFM
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #1 on: March 21, 2024, 01:55:08 AM »
The Microsoft speech recognition system will attempt recognition if it doesn't detect audio input, that is not VoiceAttack-specific.


Are you attempting to speak (very) long commands, or is this in a dictation context (I.E. freeform text entry)?

Serge

  • Newbie
  • *
  • Posts: 12
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #2 on: March 21, 2024, 02:11:26 AM »
What I want to do is :

1) Press a key to tell VoiceAttack that I'm starting to speak.

2 ) As long as I hold down this key, speak, possibly pausing, thinking about what I'm saying, etc. And while I'm doing this, I don't want the recognition to start because I haven't finished what I have to say.

3) When I release the key, send what I've said to speech recognition.

Maybe i should use the "dictation mode" to do that ? i don't anderstant exactly how it works and if it could be the soluition.

Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4792
  • RTFM
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #3 on: March 21, 2024, 02:16:18 AM »
If this is to input arbitrary text, rather than commands, then yes, the dictation mode is what I would suggest. It uses a buffer, to which whatever is recognized while the dictation mode is active will be appended.


To be clear: the Microsoft speech recognition system determines for itself where input stops.
VoiceAttack's listening state determines whether what is recognized is acted upon or not; it makes no difference to the speech recognition system.

Serge

  • Newbie
  • *
  • Posts: 12
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #4 on: March 21, 2024, 03:46:18 AM »
So I understand that Windows speech recognition comes before VoiceAttack, but that VoiceAttack can accumulate what speech recognition gives it in a buffer before using it to evaluate a command.

How can I ensure that the same key is used for :
1) Press to open the microphone and switch to dictation mode, as long as the key is held down.
2) Release to close the microphone, end dictation mode and send the buffer for processing.

Won't there be a conflict of keys between what I'm going to declare in global hotkey "Recognition global Hotkey" and the same hotkey used in a command to start and stop dictation ?


Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4792
  • RTFM
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #5 on: March 21, 2024, 04:16:07 AM »
I'll ask again: are you attempting to input arbitrary text, E.G. to have typed out into a document, I.E. not as a command, or are you attempting to speak a predefined command phrase in order to trigger a specific command?


The dictation mode is intended for the former, not the latter. While technically it's possible to have arbitrary text checked for a match to a command, it would need to be verbatim (excluding capitalization).
Dictation does not use predefined command phrases, and as such the chances of input exactly matching a predefined phrase consistently are not very high (unless you manually predefine many possible variations, which doesn't seem feasible for multiple commands).


If it is a command you're attempting to speak, do you have an example of it?

Serge

  • Newbie
  • *
  • Posts: 12
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #6 on: March 21, 2024, 04:55:32 AM »
First of all, thank you for taking the time to reply, I really appreciate it.

Here's the context:

I'v done a plugin which, when VoiceAttack hasn't recognized what I'm saying, or has recognized it but with a too low confidence level, takes over the generated audio file and works with it. (I use the "Profil Exec" profile options to call commands that give my plugin the upper hand).

My plugin then starts from the audio file (with vaProxy.Utility.CapturedAudio) and not from what was transcribed by Windows speech recognition, because Windows speech recognition is crap.

If I say a short sentence, or a long one without interruption, everything works fine. VA doesn't recognize what I'm saying and my plugin takes care of that. That's exactly what I want.

But if I make a longer sentence and pause it out of hesitation or to think, VA tries to process the first part of the sentence, doesn't recognize it, and passes it to my plugin when I haven't finished my sentence. As a result, my plugin won't have all the user sentence to work with...

So I'd like a way for VA not to try to find a matching command until the user has finished his sentence.
And when the sentence is actually finished, to have my plugin try to interpret the complete sentence audio in vaProxy.Utility.CapturedAudio.

So, in fact, the dictation mode should :
1) Cumulate audio, not only transcripted string and set the vaProxy.Utility.CapturedAudio with it when finished
2) Prevent VA from using this buffer and look for command with it.

Do you have a better image of my problem ?

Thanks for all.

Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4792
  • RTFM
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #7 on: March 21, 2024, 05:07:29 AM »
SemlerPDX created a plugin that uses the same principle. If I recall correctly, theirs combines multiple audio files for the final input.

It's open-source on GitHub, so you could have a look at it (taking note of the license as detailed in LICENSE.txt).

Serge

  • Newbie
  • *
  • Posts: 12
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #8 on: March 21, 2024, 05:34:16 AM »
Yes, that's what I thought I would do.

But since VoiceAttack is the "provider of the audio file" it would have made sense for it to handle this task itself.And even add an "automatic dictation mode" option when the user sets up a global key to listen when the key is pressed, because when we switch to push-to-talk mode, we might actually wants VA to wait for the release to do something.

Maybe in a next Voiceattack version ?

SemlerPDX

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 292
  • Upstanding Lunatic
    • My AVCS Homepage
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #9 on: March 21, 2024, 01:19:27 PM »

So, in fact, the dictation mode should :
1) Cumulate audio, not only transcripted string and set the vaProxy.Utility.CapturedAudio with it when finished
2) Prevent VA from using this buffer and look for command with it.

Do you have a better image of my problem ?

Thanks for all.

1. You can combine audio files to account for separate audio files being generated by Dictation mode, so that your final evaluation is processing off just one audio file.  This would handle pauses in your speech, resulting in an audio file of all speech during the recording period, but it requires coding experience (ofc) - as stated, there are examples in my open source plugin... however, if you copy my code for anything but personal use in large part, I would appreciate if you'd respect the license and include attribution and a copy of the same license (GPLv3).
https://github.com/SemlerPDX/OpenAI-VoiceAttack-Plugin/blob/master/OpenAI_VoiceAttack_Plugin/service/Dictation.cs

2. Dictation transcription by WSR and VoiceAttack is hit-or-miss, a 'best guess' -- I'm trying to say it's terrible without being rude.  This is why I use Whisper for my audio file transcriptions.  Actually comparing raw dictation from VoiceAttack to a list of existing voice commands in order to directly execute them respectively is not going to work out well for this reason - if VoiceAttack already didn't recognize it for low confidence or other reasons, a direct transcription of that same phrase merely has a chance to match up with an existing command.  In that case, I'd "K.I.S.S." and just lower my confidence a bit - and/or additional dynamic command phrase options - I use 50 for my default confidence level, and I have fully trained my Windows Speech Profile three times as recommended, no issues.


A final note:  I described the concept of an advanced system which could (with high accuracy) make use of my OpenAI plugin and something called "Embeddings" to more loosely match what was said to existing commands, by generating embedding vectors for each existing command phrase in the profile ahead of time, one could conceivably generate new embedding vectors for any new speech phrase captured (or transcribed from some audio file) in order to compare it with those on file already, and if a match is found, execute that command.

Embedding vectors are numeric representations of the context and meaning of text, and so are not strictly bound to the specific syntax of a phrase - if you have a command, "Turn on the lights", and compared the embeddings from a sentence such as, "Turn the lights on", the calculated cosign similarity would be extremely high (which could be set as a trigger to then execute the 'most similar' command).

This is very advanced, but the tools exist through my OpenAI Plugin, and I describe this concept in detail here:
https://forum.voiceattack.com/smf/index.php?topic=4519.msg21067#msg21067

...I also describe the concepts of using only the Whisper API through my OpenAI Plugin for utility and dictation purposes on the GitHub discussion board for my plugin here:
https://github.com/SemlerPDX/OpenAI-VoiceAttack-Plugin/discussions/6

Serge

  • Newbie
  • *
  • Posts: 12
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #10 on: March 23, 2024, 07:34:57 AM »
Hi,

Thank you for your answer.

Just to reassure you, I have been a developer for many years and I don't see any difficulty in combining .wav audio files. So, even though it is probably well done, I have no need to look at your code. Besides, when I have a coding problem, I turn to ChatGPT4 and usually find the solution in a few seconds...

A small note on this: if two people ask ChatGPT for an example of code to combine .wav files, both will get a very similar result that ultimately belongs to neither of them. And this applies to all subjects. So the notion of "code ownership" is very relative today. As everyone says, AI is going to change a lot of things...

My point was that, since the "Push to talk" feature is proposed in VA, it is easy to imagine that it could be associated with an option that allows the audio processing to be suspended before its use. It seems so logical to me that at first I thought it was the case, and it was with much surprise that I realized it was just an ON/OFF for the microphone (especially since the option is called "Recognition Global Hotkey," which is misleading).

Regarding Windows speech recognition, as I mentioned earlier, it is indeed terrible. It remains useful for VA to find a match with commands from a profile (and even then...), but beyond that... So I do SpeechToText with OpenAI from the captured audio and I do not use pre-transcribed text.

As for retrieving a command entered in a "When I say..." field of a VA profile with a database of "floats vectors," I find it complicated while remaining limiting. Good luck anyway ;D

Thank you anyway for your advices.

SemlerPDX

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 292
  • Upstanding Lunatic
    • My AVCS Homepage
Re: Prevent recognition until "Recognition Global Hotkey" is released
« Reply #11 on: March 23, 2024, 12:43:41 PM »
Just to reassure you, I have been a developer for many years and I don't see any difficulty in combining .wav audio files. So, even though it is probably well done, I have no need to look at your code. Besides, when I have a coding problem, I turn to ChatGPT4 and usually find the solution in a few seconds...

A small note on this: if two people ask ChatGPT for an example of code to combine .wav files, both will get a very similar result that ultimately belongs to neither of them. And this applies to all subjects. So the notion of "code ownership" is very relative today. As everyone says, AI is going to change a lot of things...

Glad to hear you are already proficient and comfortable coding your own things, I did not want to assume, and I provide examples to many people regardless in the event it may be helpful.

Regarding your note about code ownership, I feel I should remind you that while a code snippet example provided by a ChatGPT bot does not necessarily belong to the ChatGPT bot which suggested it, or the user who prompted the ChatGPT bot to provide it, once this snippet is integrated into that user's codebase, ownership is achieved.  Its integration into the user's codebase transforms it into a new creation, subject to the user's ownership and licensing terms.  As applied, this doesn't mean that this snippet contained within it is now owned and unable to be copied or used without violating the license, but to its specific implementation within their application.

This is why I stated that if in large part my code was used for anything but personal use, I'd appreciate respecting my GPLv3 license on my codebase and to include attribution and a copy of the license in accordance with the terms of that license.

This does not apply to personal use, to minor functions like a line or two of code so generic it could belong to any codebase (or be suggested by something like ChatGPT in a code example).  Keywords are "in large part" and "personal use".

Regarding Windows speech recognition, as I mentioned earlier, it is indeed terrible. It remains useful for VA to find a match with commands from a profile (and even then...), but beyond that... So I do SpeechToText with OpenAI from the captured audio and I do not use pre-transcribed text.

As for retrieving a command entered in a "When I say..." field of a VA profile with a database of "floats vectors," I find it complicated while remaining limiting. Good luck anyway ;D

Thank you anyway for your advices.

Excellent to hear you're also finding Whisper speech-to-text as valuable as I do for better dictation!  It's definitely a game changer, and quite likely will be able to run locally within 5 years or so - on a well performing PC, this could save even more valuable milliseconds as the true power of voice command interaction relies on the lowest latency possible between what we say and actions taking place.

I totally get that the Embeddings concept is complicated, and it is.  That said, imho it is one of the most powerful tools we can access through the OpenAI API.  I may not have fully explained the concept well, as again it is a very complicated system - however, it does not remain as limited and that's the entire point.  In the concept of matching raw audio of speech with voice commands (which, to be fair, should have executed naturally anyway), it would be more beneficial and performative than any other system given the conceptual matching allowable through Embeddings vectors.

What I mean by this is, if you issued a command that was not recognized due to either you mumbling one of the keywords or the recognition engine fouling up and misrecognizing a keyword, an Embeddings vectors database would not be so limited and could easily match up with you thought you had said (but was misrecognized) because Whisper API would have a better chance at converting the audio from your last command attempt into the proper command phrase which matches 100% to the command you had wanted to issue - and if not, attempting to find a match for the most similar command in the existing Embeddings vectors database would be quite easy and accurate given that these matches do not need to be word-for-word but based on the meaning and context of the text provided for the comparison.

If you don't understand how powerful this is, or why it is not limiting albeit complicated, the failing is on me for not explaining it well.  This is a proven concept, I have in fact moved on from proof of concept tests to the design and development phase of my next application which will offer this functionality (both standalone, and for VoiceAttack integration through a plugin interface).  In the immortal words of Todd Howard, "It just works".  ;D