Author Topic: Why does VA perform worse using mixing software? - cantabile/voicemeeter (Read 2173 times)

zL0ki · « **on:** February 11, 2021, 10:01:03 AM »

Hello.

As the title. Why does VA perform worse when using additional software like voicemeeter and cantabile?

Brief history - I've owned VA for many years but could never get it to recognise with any consistency, so never used for any length of time. This was always after much testing and reading advice etc. I'm pretty tenacious so would stick with it but eventually drop it. I'd been through boom mics, 2 Bluetooth mics with no improvement.

My latest attempt is with an XLR mic through an audo usb adapter forcusrite. The mic is the best I've had so far. With voicemeeter and cantabile I can get rid of all background noise, keyboard presses etc. Recording through audacity gives a crystal clear rendition. It was the first time I'd heard my voice, and not said 'eww do I really sound like that?!'

However, it performs worse when piping through them, as I read in posts that it doesn't play nice with other sound/mixer software. Anyone know why this is? After all, isn't it just acting as a conduit? If recordings are crystal clear, why does it not translate as such going through VA/Speech Recog?

As an side
Using the microphone native so to speak works with VA 60-70% of the time. I get similar results with both speech engines, however, the older engine comes with dictation support which is handy. I've got it working better than I have in all other previous attempts over the years so I'm encouraged. However, one annoyance is it picks up keyboard noises and interprets it as commands. I have a mechanical keyboard. Even with a 80 min confidence. Adjusting gain levels and tweaking min unrecognised level etc, presents their own problems. You end up trading places with success and failure in different areas.

Interestingly or not, I did notice using cantabile that my voice doesn't have a broad range when I speak. Compared to people I watched on vid tutorials. Their voices would appear in much broader frequency ranges. I was wondering if it's harder for speech recognition software to pick up pronunciation with less range used?

It's been fun getting into the programming side. That combined with using a remapper for a controller has whiled away many a free down time.

Stay safe

Gary · « **Reply #1 on:** February 11, 2021, 10:39:24 AM »

In simple terms, Windows speech recognition listens on a particular audio channel. If this channel is not clear, the speech engine is not going to be working optimally (if at all). Any additional software that interferes with the signal in that channel is going to have some level of effect. If the speech recognition is fine without extra software added, and not fine when extra software is added, the obvious point of failure is the extra software. There's not much that VA or the Windows speech engine (or any audio software for that matter) is going to be able to do when the audio signal is not up to par due to external sources. It's akin to asking why your amp works right if your guitar is plugged into it directly, but suddenly doesn't sound right if your guitar signal passes through pedals.

In addition, it's recommended that you use a high-quality, USB headset with noise cancellation when using speech recognition on Windows. Anything other than that, you're at the mercy of your hardware - desktop/boom mics pick up too much (which is exactly what they are designed to do), bluetooth headsets have audio quality issues in certain circumstances (there are other threads here on this) and analog headsets require *constant* settings fiddling to get it right. It's not a bad thing to use other types of microphones, you just need to know that there are limitations to those devices when it comes to speech recognition on a PC.

Hope that helps!

zL0ki · « **Reply #2 on:** February 11, 2021, 11:10:20 AM »

Thanks for response.

Trying to get my head around what you said. So the fact with the additional software making the voice crystal clear is a misnomer in this setup. Because windows speech recognition is 'hard coded' to listen to a specific channel. The additional software just exacerbates the issues by muddying the water even more, so to speak.

Just to note the mic is clear without mixers. It just picks up the ambient noise as you mentioned. If anyone was to listen to it no one would have an issue understanding me. It's clear but with ambient background noise.

The type of usb headset. Is it better with one that has the mic extension over the mouth or inbuilt into the headset?

edit - reason I haven't gone for a headset type is because I have wanted to keep using my lovely sounding analogue headphones

thanks.

Pfeil · « **Reply #3 on:** February 11, 2021, 03:05:01 PM »

Anything you place in the audio path to the speech recognition engine can affect recognition.

By definition, filters remove information, so while it sounds clear to your ears, the speech recognition engine is not a human listener, and may work better when certain information is still present.

Not all headsets, USB or otherwise, are equally suited to speech recognition. Price is not necessarily an indication of suitability either.

The closer a microphone is placed to what you are recording, the less sensitive it needs to be, which should lower the amount of environmental noise it picks up, in theory (this also depends on the type of microphone, and what noise cancelling, if any, is applied in hardware).

Mechanical keyboard noise can be quite loud, and possibly overlaps with the frequency range of human speech, so filtering it may cut out useful information for the speech recognition engine as well. There is only so much you can do about that without affecting either the signal integrity of the audio, or the typing quality of your keyboard.

An add-on boom-style microphone may work, E.G. an Antlion Modmic (this is not an endorsement, and I do not have personal experience with the product), if you're looking to keep an existing set of headphones (do note that, especially, but not only, open-backed headphones can project into the environment, where it can be picked up by a microphone in relatively close proximity).

Lastly, training you speech recognition profile is very important when using the Microsoft Speech Recognition engine (rather than Speech Platform 11); your speech recognition profile is intended to be trained with a specific microphone, and at least three training sessions should be completed.
If you switch to a different microphone (or even a different physical environment, which may have new acoustic characteristics), you should create and train a new speech recognition profile. Instructions for doing so can be derived from this topic.

zL0ki · « **Reply #4 on:** February 12, 2021, 02:56:55 AM »

That does make sense.

I've been working with the training. This should help over time.

Outside of training, I noted that the dictionary section is global rather than profile specific. Unless I missed a step somewhere. Swapping profiles kept the same info.

I thought this could have potential conflicts if you used it overmuch, especially the excludes. Unless it stores the info under each profile but displays a global dictionary?

Pfeil · « **Reply #5 on:** February 12, 2021, 01:52:09 PM »

The dictionary is part of the Microsoft speech recognition system, so it applies to anything that uses the speech recognition engine, including VoiceAttack and the Windows Speech Recognition application.

zL0ki · « **Reply #6 on:** February 13, 2021, 03:10:03 AM »

I understand that. However, I believe it a poor design choice making the dictionary a global entity rather than profile specific. Or having the option to specify. The whole Windows speech recognition system and software is archaic.

For instance you might want to liberally omit offending words during a certain profile specific scenario. However, this may be very unfavourable during a more common setup. You would have to manually remove those words. Unless I have totally misunderstood what the dictionary function is for.

Out of completeness for anyone chances on reading the thread.

I discovered that VA is perfectly fine with Voicemeeter. The offender in my case was using VST plugins such as a noisegate. There is a noticeable delay with VA registering the mic as off and then back on when speech was used.
I failed to notice this during my trials as there was no clipped speech in my recordings. However, it didn't play nice with VA.

I also looked more closing with the issues I was having with the failed words. They all shared a similar pronunciation. I swapped them out with pronunciations that were highly successful. Seems obvious in reflection but it's new to have to analyse ones own pronunciations. Especially how we now use the ubiquitous cloud-based assistants without as much scrutinisation.

A also applied a pop filter to my XLR microphone. This made a difference. I'm now achieving a 90-100% success rate.

At some point I will replace the loud keyboard, but it's less of an issue with being able to bump up the confidence levels even higher.

One other observation. I had an issue with the word 'west' being recognised. I changed it to 'westerly'. Used it once then just used the word 'west' and it would still pick it up every time. I guess it forced the system to use the new sounds rather than what was stored in it's memory for west? Or was it a one off?

The advice and feedback was much appreciated.

Pfeil · « **Reply #7 on:** February 13, 2021, 03:01:30 PM »

Quote from: zL0ki on February 13, 2021, 03:10:03 AM

I had an issue with the word 'west' being recognised. I changed it to 'westerly'. Used it once then just used the word 'west' and it would still pick it up every time. I guess it forced the system to use the new sounds rather than what was stored in it's memory for west? Or was it a one off?

I suppose it could be possible. Does removing "westerly" from the dictionary have the inverse effect?

Unfortunately, Microsoft provides little information on the inner workings of the speech recognition engine, and with the release of Windows 8 a lot of the documentation that was available, was removed.

zL0ki · « **Reply #8 on:** February 15, 2021, 03:58:11 AM »

I think it was just a case of reading more into it than it was. The fact it wasn't working at all before until I did the change was probably misleading. Some other factor could be involved.

I deleted all manually edited words for west, westerly. Success was still 100% with changing the word to either 'west' or 'westerly' in VA. It was still giving success if you just uttered 'west' but were using the word 'westerly'. It picked up enough of the word to make a good assumption.

If it does it again for another case I'll be more inclined to believe something was in it.

Author Topic: Why does VA perform worse using mixing software? - cantabile/voicemeeter (Read 2173 times)

zL0ki

Why does VA perform worse using mixing software? - cantabile/voicemeeter

Gary

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

zL0ki

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

Pfeil

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

zL0ki

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

Pfeil

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

zL0ki

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

Pfeil

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter

zL0ki

Re: Why does VA perform worse using mixing software? - cantabile/voicemeeter