I was pondering this question earlier. What clues do modern algorithms (specifically those that convert voice to text) use to determine which homophone was said (E.g. to, too, or two?)
Do they use contextual clues? Sentence structure? Perhaps there are slight differences in the way each word is usually pronounced (for example, I usually hold the o sound longer in two than in to). A combination of the first two seems most plausible.
Do they use contextual clues?
Yes, ASR systems use cross-word context. For example if previous word is "going" the next word will likely to be "to" not "two". ASR systems account for probabilities and select the best probable decoding variant.
Sentence structure?
Yes, ASR systems use more advanced language models as well to predict probable words given the context.
Perhaps there are slight differences in the way each word is usually pronounced (for example, I usually hold the o sound longer in two than in to).
That too. Actually "too" and "to" are pronounced quite differently. "to" is often reduced to shwa.
If you are interested in speech recognition algorithms, it may have sense to read ASR book or check online course. See for details
https://sourceforge.net/p/cmusphinx/discussion/speech-recognition/thread/3ea89abf/