I take it you're avoiding using text-to-speech on purpose?
If you rename those files this can be done much more elegantly, as you don't have to convert from the numerals the token outputs:
Begin Text Compare : [{TXTSUBSTR:"{TIMEHOUR24}":0:1}] Does Not Equal '0'
Play sound, '{VA_SOUNDS}\{TXTSUBSTR:"{TIMEHOUR24}":0:1}.mp3' (and wait until it completes)
End Condition
Play sound, '{VA_SOUNDS}\{TXTSUBSTR:"{TIMEHOUR24}":1:1}.mp3' (and wait until it completes)
Play sound, '{VA_SOUNDS}\{TXTSUBSTR:"{TIMEMINUTE}":0:1}.mp3' (and wait until it completes)
Play sound, '{VA_SOUNDS}\{TXTSUBSTR:"{TIMEMINUTE}":1:1}.mp3'
E.G. if it's 5:33, it'd play "5.mp3", "3.mp3", and "3.mp3" again; If it's 12:09 it'd play "1.mp3", "2.mp3", "0.mp3", and "9.mp3".