The simple way to mask the time to access the cloud, process, and return the results is to cover most of that time with a confirmation dialogue while it's happening. The stock "Yes, sir, I'll look that up now", or for a JARVIS effect, some sarcastic backtalk as it's done. Perhaps even parsing if there are multiple parts to the request that can be divided up, and while the rest of it is being done, start with the first sections. And lastly, a method of recovering gracefully TTS-wise from an unusually long wait time or failure.
But part of your question and I think a bigger challenge is more on speech recognition and parsing what's being asked for. Just look at many posts on here, often times it's how to get a better recognition of the words. If the engine doesn't understand what you want, the end problem of TTS is a bit moot.
One interesting TTS solution for the common phrases that someone here or on one of the the Elite board did was to store results from the cloud engine they were querying, so the first time a new phrase was needed it was sent to the cloud, but from then on if it was found in the database the stored response was played. This was done to audibly play NPC text responses. Combined with the other techniques, you could hide what was instant and what wasn't, and after a while not need the cloud so much.