Author Topic: Speaking numbers  (Read 15156 times)

Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4786
  • RTFM
Speaking numbers
« on: May 21, 2021, 01:34:53 PM »
This post will attempt to provide the means to create commands that recognize ranges of spoken numbers.


More general information on the mentioned features is available in the official documentation. If you are not familiar with a given feature, read the documentation before proceeding.

Press F1 while VoiceAttack has focus to open VoiceAttackHelp.pdf in your default PDF viewer, use the "VoiceAttackHelp.pdf link in the start menu (for the website version), or navigate to VoiceAttack's installation directory and open the file directly.

Most PDF viewers have a search function (try pressing Ctrl-F on your keyboard).


Using Dynamic Command Sections

Dynamic command sections allow you to specify multiple spoken options for part of a command phrase, without having to repeat the entire phrase, as would be necessary when only using semicolons to specify multiple phrases for a single command.

E.G. rather than "I want to ride my bicycle;I want to ride my bike", you can instead use "I want to ride my [bicycle;bike]"


Important to note is that spaces are inserted automatically before and after dynamic command sections, provided they are not at the beginning or end of a phrase.
In addition, consecutive spaces in a command phrase are automatically reduced to a single space.

E.G. "        word       [    section             1][section 2       ]     " would be equivalent to "word section 1 section 2"

As a consequence, this means that dynamic sections generally cannot start or end in the middle of a word.
E.G. "I want to ride my bi[cycle;ke]" would result in the phrases "I want to ride my bi cycle" and "I want to ride my bi ke", which can affect the way the speech recognition engine recognizes them.
This may not be as important for numbers, as mentioned later on, but still worth keeping in mind.


Dynamic command sections support numeric ranges, which allow you to specify two numbers, separated by two dots (note two, not three), and have VoiceAttack automatically generate phrases for every number between them.
E.G. "[1..10]" will generate the equivalent of "1;2;3;4;5;6;7;8;9;10", allowing you to speak any of those numbers to trigger the command, without needing to manually specify every single one.
Note that the order doesn't matter; you could enter "[10..1]" and it would have the same effect.

Numeric ranges also support an optional multiplier, placed after the second number and separated by a comma.
E.G. "[1..10,5]", would generate the equivalent of "5;10;15;20;25;30;35;40;45;50"


Dynamic command sections can be a very convenient option for speaking numbers as part of a command phrase, in a fluid and reliable manner.
However, this convenience also makes it possible to very quickly generate very large amounts of phrase variations, intentionally or not, which will increase the time a profile takes to load, and at a certain point, cause VoiceAttack (in case of the 32bit version; the 32bit version can normally only use around 4GB of memory, which is not a VoiceAttack-specific limitation.) or the speech recognition itself (in case of the 64bit version) to run out of memory, and crash.

Note that this would occur at the point the profile is loaded, or when the amount of phrase variations is calculated (E.G. when saving a command). After a profile is loaded, memory usage will decrease dramatically, and VoiceAttack generally does not use large amounts of memory during normal use.

The point at which the maximum allowable memory usage is exceeded is not feasible to quantify exactly, as it would depend on the length of your command phrases, the amount generated, the other commands in your profile, and other factors.
Though, unless your computer posses less than 8GB of memory, it should not be affected by the specifications of your hardware.

All of that said, some user-created profile contain several million phrases, so you are unlikely to run into this limit under normal circumstances, as long as you don't generate what you don't need.


What that means in practice, is that you'll want to be aware of how many phrases your commands generate, and try to only generate the phrases you actually need.

As mentioned, the amount of phrases is not the only factor that can affect how much memory is used, however it can still serve as a rough indication.
If your command will generate over 100 phrase variations, VoiceAttack will notify you of this when saving the command.
The total amount of phrase variations generated by all your commands is shown when hovering your mouse cursor over the "commands" label on the "Edit a Profile" window. If you have recently made changes to your commands, you may need to save the profile and reopen the window for those changes to be calculated (this will be noted when hovering over the "commands" label).


As an example, a fairly common use of numbers in command phrases is entering radio frequencies for flight simulation (though this example will show techniques that apply to many scenarios and target applications outside of that niche).

E.G. you want to tune a VHF radio to a standard communications frequency.

This could be done using the "When I say" input "set radio frequency [118..136].[0..999]", which generates 19000 variations

You may note that this will insert a space on both sides of the decimal point, resulting in phrases like "set radio frequency 118 . 0", however in my experience the speech recognition engine will still recognize such phrases, E.G. when speaking "set radio frequency one hundred-eighteen point zero", "set radio frequency one hundred and-eighteen point zero", "set radio frequency one-eighteen point zero", or even "set radio frequency one one eight point zero" (numbers can usually be spoken as separate digits, regardless of whether they have spaces between them in the predefined phrase)

However, while the first numeric range is correctly reduced to only generate numbers from 118 to 136 (as the normal frequency range goes from 118Mhz to 136Mhz), the values after the decimal point are not.

Given that normally these frequencies would be separated by 25Khz, E.G. 118.0Mhz, 118.25Mhz, 118.5Mhz, etc..., there should be no need for all of the values in-between them, and the maximum value is 136.975, not 136.999.

Therefore, the "When I say" input can be changed to "set radio frequency [118..136].[0..39,25]", as 975 / 25 = 39. This only generates 760 phrase variations


Now we come to the practical matter of actually using these values in a command.

Depending on how you intend to use the numbers you have spoken, you may want to extract the Mhz and Khz sections separately.
The simplest way of doing so is to use the "{CMDSEGMENT:}" token.

E.G.
Code: [Select]
Write [Blue] '{CMDSEGMENT:1}Mhz {CMDSEGMENT:3}Khz' to log
would output "118Mhz 0Khz" to the log on the main window with a blue icon when speaking "set radio frequency 118.0"

If you are unfamiliar with tokens, this topic may be of use in addition to the official documentation.

Note that command segments are not the same thing as dynamic command sections, though they are certainly related.
A segment can consist of a single static or dynamic section, meaning in the example above, the first segment is the static section "set radio frequency". This segment would be at index 0 (I.E. command segments are zero-indexed), the second segment is the dynamic section "188" at index 1, the third segment is the static section "." (note that the spaces at the beginning and end of a segment are trimmed, I.E. removed) at index 2, and the fourth segment is the dynamic section "0" at index 3. Index 4 would return "Not set", as there is no predefined section at that index.


If, on the other hand, you want both numbers as part of a single value separated by a decimal point, without the spaces in-between, the "{TXTNUM:}" token can extract that for you.

To do so, you would use the "{CMD}" token to get the spoken phrase in it entirety, and then pass that to the "{TXTNUM:}" token wrapped in double quotes (as it is rendered as a literal text value; as mentioned in the linked topic, tokens are not variables).

E.G.
Code: [Select]
Write [Blue] '{TXTNUM:"{CMD}"}' to log
would output "118.0" to the log on the main window with a blue icon when speaking "set radio frequency 118.0"


If you experiment with this command, you'll eventually realize that there is a problem: The phrases "set radio frequency 118.25", "set radio frequency 118.5", and "set radio frequency 118.75" are numerically indistinguishable from the phrases "set radio frequency 118.250", "set radio frequency 118.500", and "set radio frequency 118.750"

To resolve this ambiguity, the command phrase needs to be divided up into a few separate variations, E.G. When I say "set radio frequency [118..136].0;set radio frequency [118..136].0[1..3,25];set radio frequency [118..136].[4..39,25]"

The first phrase will generate "set radio frequency 118 .0" to "set radio frequency 136 .0"
The second phrase will generate "set radio frequency 118 .0 25" to "set radio frequency 136 .0 75" (starting at .0 25 each time, E.G. "119 .0 25", "120 .0 25", etc...)
The third phrase will generate "set radio frequency 118 . 100" to "set radio frequency 136 . 975" (starting at . 100 each time, E.G. "119 . 100", "120 . 100", etc...)

If you're using the "{TXTNUM:}" token, that'll work. If you're using the "{CMDSEGMENT:}" token, you'll need to use a slightly different set of phrases, as the indices of the different segments should match up between phrases.

E.G. in the example above, when speaking "set radio frequency 118.0", the segment at in index 0 would still be "set radio frequency", and at index 1 would still be "118", however the segment at index 2 would be ".0", and at index 3 would be "Not set".

This could also be an issue with spoken phrases like "set radio frequency 118.025", as the segment at index 2 would be ".0", and the segment at index 3 would be "25". If your output is in Khz, that's fine as is, but if your target application interprets it as a value after a decimal point, you'll need to manually specify the different variations as numeric ranges ignore zeroes at the start of numbers (E.G. "025" is interpreted as "25")

E.G. "set radio frequency [118..136].[0];set radio frequency [118..136].[025;050;075];set radio frequency [118..136].[4..39,25]" would consistently output the numeric segments at the same indices.



All that said, what if you absolutely need a large numeric range, E.G. you do in fact need to be able to speak frequencies from "1.0" to "999.999"?

Given that doing that in a single command phrase would produce 998001 phrase variations, you're going to need to split that up into multiple inputs and use the "Wait For Spoken Response" action, or you'll need to use wildcards.


Using the "Wait For Spoken Response" action

The "Wait For Spoken Response" action allows you to specific spoken phrases using the same syntax as the "When I say" field, and the action is executed, the command will essentially pause and wait for spoken input (either for a specified number of seconds, or indefinitely if the "Timeout" value is set to 0)

There are three main caveats to this: You need to pause momentarily before speaking a phrase intended for this action, if you are already speaking (I.E. if cannot be seamlessly chained with a normal command phrase, or another "Wait For Spoken Response" action), there is a limit of 500 phrase variations (to prevent performance issues), and any output only exists as a text variable value. The latter means that tokens such as "{CMD}" and "{CMDSEGMENT:}" cannot be used for the phrases recognized using this action (those tokens only apply to the phrase recognized to execute the command).


You can use the action to split up a given phrase, E.G. by inputting "set radio frequency [118..136]" into the "When I say" field, and "point 0;.025;.050;.075;. [4..39,25]" in the "Responses" field of the "Wait For Spoken Response" action (this example is used merely for consistency, and would normally be fine as a single phrase).

This would reduce the generated phrase count from 760 to 59 (19 from the "When I say" field, and 40 from the "Wait For Spoken Response" action; note that phrases generated by "Wait For Spoken Response" actions are not shown in the total derived count of the "Edit a Profile" window)

However, because the speech recognition engine needs to finish recognition between triggering the command, and waiting for the spoken input for the "Wait For Spoken Response" action, rather than speaking something like "set radio frequency one one eight point zero", it would need to be "set radio frequency one one eight...point zero", with a pause between "eight" and "point".

While more cumbersome than a single phrase, because all possible phrase variations are predefined (I.E. compiled into a list and passed to the speech recognition engine so it can compare its input to the items in the list) this method still allows for fairly accurate recognition.


Retrieving the spoken number can be done using a combination of the "{CMDSEGMENT:}" and "{TXTNUM:}" tokens, E.G.
Code: [Select]
Write [Blue] '{CMDSEGMENT:1}.{TXTNUM:~response}' to log
would output "118.0" to the log on the main window with a blue icon when speaking "set radio frequency 118...point 0", assuming you have the "Wait For Spoken Response" action configured to output to the text variable named "~response".

Note that in the example "Responses" input of "point 0;.025;.050;.075;. [4..39,25]", "point 0" writes out "point" rather than ".0" as the speech recognition engine more consistently recognized that on my machine (this is a caveat of the linguistic rules Microsoft built into the speech recognition engine, which would normally also cause it to recognize "point zero two five" as "0.025", but with the predefined phrases I found that to work fairly well regardless).


The 500 phrase variation limit does mean that given the example of "1.0" to "999.999", you'd either need to split into "set radio frequency [1..999] point" in the "When I say" field, and three "Wait For Spoken Response" action with "[1..9]" in their "Responses" field (assuming that "set radio frequency nine nine nine...nine...nine...nine" is slightly more natural than "set radio frequency nine nine nine...nine nine...nine"), or use the last option: Wildcards.


Using Wildcards

Wildcards allow you to specify a partial phrase that will be used to trigger a command, while using the speech recognition engine's dictation features to attempt to freely recognize what you are saying.

In theory, this is the most flexible method of recognizing commands, however in practice, because freely recognizing speech without any real context is a very difficult task, it may not as reliable as desired.

It is in part because of this that the official documentation notes that this feature is "somewhat unsupported".


Do make sure that your speech recognition profile is well-trained (run through the training at least three times; instructions for starting the training can be found here), and that you are in a quiet environment with a (headset) microphone well-suited to speech recognition.

Also note that this feature requires the SAPI speech recognition engine, and is not available with Speech platform 11 (as Microsoft did not design the latter to have dictation support in general).
If you have the "Use Built-In SAPI Speech Engines" option enabled on the "System / Advanced" tab of the VoiceAttack options window, dictation should normally be available.


Wildcards (denoted by the "*" character) can be places before or after a specified phrase, but not in the middle. Note that each semicolon-separated phrase is separate, and so each can have its own configuration of wildcards, and dynamic command sections can form part of that specified phrase, but cannot contain the "*" character.

E.G. "*and something;something [else;] and*" would allow you to speak things like "I'd like to order fries and something" or "something else and a milkshake" or "something and a refrigerator full of motor oil in a house with a blue roof on the prairie", etc...

So aside from the specified phrase that must be present in order for a command to be recognized, everything else is undefined, and sometimes unpredictable.


Going back to the earlier example of actually recognizing a number, the specified phrase could be "set radio frequency*", thus leaving anything after that up to the speech recognition engine, which is generally fairly accurate when recognizing numbers in this way.

One caveat to note is that when recognizing whole numbers (I.E. without a decimal separator), single digits (numbers from 0 to 9) are written out by the speech recognition engine. E.G. "1" is transcribed as "one".
You can use the "{TXTWORDTONUM:}" token to work around this.


Assuming numbers with a decimal separator, the action list could again look like:
Code: [Select]
Write [Blue] '{TXTNUM:"{CMD}"}' to logWhich would output "999.999" to the log on the main window with a blue icon when speaking "set radio frequency 999.999", provided that is what the speech recognition engine recognized.


Splitting the digits before and after the decimal point is more complex in this case, given that the "{CMDSEGMENT:}" token is not available for command phrases containing wildcards.

E.G.
Code: [Select]
Write [Blue] '{TXTSUBSTR:"{TXTNUM:"{CMD}"}":0:{TXTPOS:".":"{TXTNUM:"{CMD}"}":}}Mhz {TXTSUBSTR:"{TXTNUM:"{CMD}"}":{EXP:{TXTPOS:".":"{TXTNUM:"{CMD}"}":}+1}:}Khz' to log
would output "999Mhz 999Khz" to the log on the main window with a blue icon when speaking "set radio frequency 999.999", provided that is what the speech recognition engine recognized.

This combination of tokens does produce the desired output, though a feature suggestion for a simpler method to split text has been submitted.

SemlerPDX

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 291
  • Upstanding Lunatic
    • My AVCS Homepage
Re: Speaking numbers
« Reply #1 on: May 21, 2021, 03:19:45 PM »
Excellent post!  Lots of great info here!

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #2 on: July 17, 2023, 04:15:21 PM »
hello, what can we do about VoiceAttack hearing the spoken phrase correctly, but not actually being able to match it to the expected macro?

Specifically, why would voice detection hear "UHF set 251.0" but it cannot correlate it to a VA voice command as shown here:



the voice command is structured correctly, I have other radios I tune just fine (like the VHF one above it works). For whatever reason though my UHF radio doesn't hear "UHF set 2 5 1 decimal 0", even though what it did hear is identical, and thus is not an issue with hearing, but more interpreting

SemlerPDX

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 291
  • Upstanding Lunatic
    • My AVCS Homepage
Re: Speaking numbers
« Reply #3 on: July 17, 2023, 07:01:41 PM »
hello, what can we do about VoiceAttack hearing the spoken phrase correctly, but not actually being able to match it to the expected macro?

Specifically, why would voice detection hear "UHF set 251.0" but it cannot correlate it to a VA voice command as shown here:



the voice command is structured correctly, I have other radios I tune just fine (like the VHF one above it works). For whatever reason though my UHF radio doesn't hear "UHF set 2 5 1 decimal 0", even though what it did hear is identical, and thus is not an issue with hearing, but more interpreting

I would assume that it was not recognized due to a too low confidence.  You might try lowering the required confidence either for this command or globally in VoiceAttack options under the Recognition tab.

Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4786
  • RTFM
Re: Speaking numbers
« Reply #4 on: July 17, 2023, 10:35:12 PM »
You can enable the "Show Confidence Level" option on the "Recognition" tab of the VoiceAttack options window, to get a visual indication of the speech recognition's confidence with each relevant log entry on VoiceAttack's main window.

Neither "UHF" nor "VHF" are dictionary words, so you may want to add them to the speech recognition dictionary.
Click the wrench icon on VoiceAttack's main window, and on the "Recognition" tab of the VoiceAttack options window click the "Utilities >" button, then choose "Add/Remove Dictionary Words"

You may also notice that my examples generally don't split up whole numbers into different dynamic command sections, which is because in my experience the Microsoft speech recognition system recognizes numbers spoken as separate digits as part of a larger number, rather than separate numbers, under normal circumstances.

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #5 on: July 18, 2023, 01:58:15 PM »
Thanks guys for the feedback. "Show Confidence Level" is already on, this one command is just completely unrecognized. Even when I set minimum confidence level to 5% for this command, it doesn't detect it.
Interestingly, I did previously try adding UHF to the dictionary words, but it didn't help. VHF is detected fine and its not even in my dictionary.

Gary

  • Administrator
  • Hero Member
  • *****
  • Posts: 2832
Re: Speaking numbers
« Reply #6 on: July 18, 2023, 02:13:30 PM »
Have you tried dropping the command and creating a new one with the same 'When I say' value?  Wondering if there's something broken in the command somehow.

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #7 on: July 20, 2023, 09:08:28 AM »
its really odd. I think there's something wrong with UHF as a keyword, though I don't know how its different from VHF, HF, FM1, FM2 which all work. I tried with UHF in my dictionary, and removed from my dictionary.

I dumbed it down to a simple command:


this results in:


But if I just change UHF to Mike, then it works with 96% confidence.


If I remove the decimal part from the UHF version, it also works fine:




Gary

  • Administrator
  • Hero Member
  • *****
  • Posts: 2832
Re: Speaking numbers
« Reply #8 on: July 20, 2023, 09:16:09 AM »
Can you copy and paste your 'When I say' for that command here?

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #9 on: July 20, 2023, 09:45:46 AM »
certainly:
UHF set [0..249] [decimal;.] [0..9]

Gary

  • Administrator
  • Hero Member
  • *****
  • Posts: 2832
Re: Speaking numbers
« Reply #10 on: July 20, 2023, 08:45:35 PM »
I'm also having real difficulty with the speech engine when saying, 'UHF' anything -  'VHF' seems to work great tho.  I'll see if there's anything I can come up with for a workaround. 

weeeeeeeird

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #11 on: July 21, 2023, 10:38:33 AM »
Thanks for looking into this Gary.

I agree, its very weird. It's not like VoiceAttack is mishearing UHF, it actually hears it fine, it just acts in an unpredictable manner with the subsequent voice command after it hears UHF.

SemlerPDX

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 291
  • Upstanding Lunatic
    • My AVCS Homepage
Re: Speaking numbers
« Reply #12 on: July 21, 2023, 02:18:28 PM »
FTR, I tested this with homophones and it worked with very high confidence (though for personal preference, I did change the word "decimal" to have the option for "point", which is more natural in English)

Code: [Select]
You aych F set [0..249] [decimal;point;.] [0..9]


I use homophones extensively in my public profiles to achieve consistent recognition for a variety of users, some speaking English as a second language.  For what its worth, I also have failures when it is "UHF", too - can't get around it, even adding to Dictionary.

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #13 on: July 21, 2023, 04:17:15 PM »
even with the exact same homophone, I get the same:


I almost feel like a reset of Microsoft's Speech Recognizer is in order

Pfeil

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4786
  • RTFM
Re: Speaking numbers
« Reply #14 on: July 22, 2023, 12:44:48 AM »
Did you remove "UHF" from the speech recognition dictionary? If not, that might bias it toward recognizing the dictionary term rather than the homophone.


Closest you can get to "resetting" the speech recognition system, aside from reinstalling Windows itself, is to create a new speech recognition profile (instructions for that can be found in this topic)

However, given that it's happening to multiple users, it would appear this is a quirk of said speech recognition system, not one specific to your configuration.


The Microsoft speech recognition system comes with a number of behaviors that aren't (or are no longer) documented, and can't be modified by anyone but Microsoft themselves. This seems to be one of them.
« Last Edit: July 22, 2023, 10:26:09 AM by Pfeil »

zyll

  • Newbie
  • *
  • Posts: 7
Re: Speaking numbers
« Reply #15 on: July 22, 2023, 10:25:21 AM »
such an unusual quirk. I did remove UHF from the speech recognition dictionary. I am going to chalk this up to Microsoft bizarreness. Appreciate you guys all weighing in, I had a feeling it was an odd edge-case.

UseLessUK

  • Newbie
  • *
  • Posts: 6
Re: Speaking numbers
« Reply #16 on: July 29, 2023, 02:22:47 PM »
I tried with:
Code: [Select]
U H F set [0..249] [decimal;.] [0..9]and it worked fine.