Normally the speech recognition engine will use the format from the predefined phrase, E.G. "instrument [1..10]" should output digits ("instrument 1", and so on), rather than words ("instrument one", etc..).
When using dictation (including wildcards in command names), single-digit numbers will be output as words, and numbers with more digits will be output as a set of digits. This is the behavior built into the Microsoft speech recognition engine, and what the "{TXTWORDTONUM:}" token is designed to work around.