The Microsoft speech recognition system has a few built-in patterns like that (unfortunately there is very little official documentation on those, as far as I'm personally aware).
Using that output as a command phrase would require wildcards, which are noted to be a somewhat unsupported feature.
Part of the reason for that would be that the dictation output from the speech recognition system (which is what must be relied on when not using predefined phrases, which is the case with wildcards), even if it recognizes what you're saying perfectly, isn't necessarily consistent.
E.G. in this case (assuming the US English engine), "three minutes four seconds" will be transcribed as "00:03:04", but "three minutes" will instead be transcribed as "3 minutes", whereas "three minutes zero seconds" is, rather oddly, transcribed as "3 minutes zero secs"
To parse the output in a hh:mm:ss pattern, specifically, you could use the {TXTSUBSTR:} token, to retrieve the relevant sets of digits (simplest method for removing the leading zeroes, should you need to, may be to just convert them to integer values).
With a predefined command like the one in your example, it's worth evaluating how much precision is actually required.
E.G. "Timer [0..60] minutes [0..12,5] seconds" generates 793 permutations, while "Timer [0..60] minutes [0..6,10] seconds" generates 427