What is the official encoding for Twitter's streaming API? My best guess is UTF-8 based on what I've seen, but I would like to avoid making assumptions.
The only part of the Twitter site I've seen where they even hint at what they use as their official encoding is here:
Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation
https://dev.twitter.com/docs/counting-characters
Does anyone have a more "official" answer? I'm writing a state-machine tokenizer for the streaming API which makes certain assumptions. The last thing I want is to encounter something like UTF-16.
Thanks! :D
One indicator is that the JSON format, which Twitter uses for virtually everything, dictates (or at least defaults to) UTF-8. They should also set an appropriate HTTP header denoting the encoding (but I haven't confirmed this). If you're using XML instead, the XML opening tag explicitly denotes the encoding, which is UTF-8.
If they say they use UTF-8, that's a pretty good bet. UTF-8 is very common, and UTF-16 in the wild is pretty rare from what I've seen.
There are also some clever libraries you could use if you were so inclined to prove it to yourself by testing whether they support various characters. The best of these is used by Firefox to detect the encoding of webpages as they're loaded: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html