Just to reassure you, I have been a developer for many years and I don't see any difficulty in combining .wav audio files. So, even though it is probably well done, I have no need to look at your code. Besides, when I have a coding problem, I turn to ChatGPT4 and usually find the solution in a few seconds...
A small note on this: if two people ask ChatGPT for an example of code to combine .wav files, both will get a very similar result that ultimately belongs to neither of them. And this applies to all subjects. So the notion of "code ownership" is very relative today. As everyone says, AI is going to change a lot of things...
Glad to hear you are already proficient and comfortable coding your own things, I did not want to assume, and I provide examples to many people regardless in the event it may be helpful.
Regarding your note about code ownership, I feel I should remind you that while a code snippet example provided by a ChatGPT bot does not necessarily belong to the ChatGPT bot which suggested it, or the user who prompted the ChatGPT bot to provide it, once this snippet is integrated into that user's codebase, ownership is achieved. Its integration into the user's codebase transforms it into a new creation, subject to the user's ownership and licensing terms. As applied, this doesn't mean that this snippet contained within it is now owned and unable to be copied or used without violating the license, but to its specific implementation within their application.
This is why I stated that if
in large part my code was used for anything but personal use, I'd appreciate respecting my GPLv3 license on my codebase and to include attribution and a copy of the license in accordance with the terms of that license.
This does not apply to personal use, to minor functions like a line or two of code so generic it could belong to any codebase (or be suggested by something like ChatGPT in a code example). Keywords are "
in large part" and "
personal use".
Regarding Windows speech recognition, as I mentioned earlier, it is indeed terrible. It remains useful for VA to find a match with commands from a profile (and even then...), but beyond that... So I do SpeechToText with OpenAI from the captured audio and I do not use pre-transcribed text.
As for retrieving a command entered in a "When I say..." field of a VA profile with a database of "floats vectors," I find it complicated while remaining limiting. Good luck anyway
Thank you anyway for your advices.
Excellent to hear you're also finding Whisper speech-to-text as valuable as I do for better dictation! It's definitely a game changer, and quite likely will be able to run locally within 5 years or so - on a well performing PC, this could save even more valuable milliseconds as the true power of voice command interaction relies on the lowest latency possible between what we say and actions taking place.
I totally get that the Embeddings concept is complicated, and it is. That said, imho it is one of the most powerful tools we can access through the OpenAI API. I may not have fully explained the concept well, as again it is a very complicated system - however, it does not remain as limited and that's the entire point. In the concept of matching raw audio of speech with voice commands (which, to be fair, should have executed naturally anyway), it would be more beneficial and performative than any other system given the conceptual matching allowable through Embeddings vectors.
What I mean by this is, if you issued a command that was not recognized due to either you mumbling one of the keywords or the recognition engine fouling up and misrecognizing a keyword, an Embeddings vectors database would not be so limited and could easily match up with you thought you had said (but was misrecognized) because Whisper API would have a better chance at converting the audio from your last command attempt into the proper command phrase which matches 100% to the command you had wanted to issue - and if not, attempting to find a match for the most similar command in the existing Embeddings vectors database would be quite easy and accurate given that these matches do not need to be word-for-word but based on the meaning and context of the text provided for the comparison.
If you don't understand how powerful this is, or why it is not limiting albeit complicated, the failing is on me for not explaining it well. This is a proven concept, I have in fact moved on from proof of concept tests to the design and development phase of my next application which will offer this functionality (both standalone, and for VoiceAttack integration through a plugin interface). In the immortal words of Todd Howard, "It just works".