I use TIdHttp to fetch web content. The response header indicates the content encoding to be utf8. I want to print content in console as CP936 (simplified chinese), but the actual content is not readable.
Result := TEncoding.Utf8.GetString(ResponseBuffer);
I do the same thing in python (using httplib2) without any problems.
def python_try():
conn = httplib2.HttpConn()
respose, content = conn.get(...)
print content.decode('utf8') # readable in console
UPDATE 1
I debugged the raw response and noticed that the content is gzipped.
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Content-Encoding: gzip
Vary: Accept-Encoding
Date: Mon, 24 Dec 2012 15:27:44 GMT
Connection: Keep-Alive
I tried to assign a IdCompressorZLib instance to IdHttp instance. Unfortunately, the application will crash while decompressing gzipped content. The test address is "http\://www.baidu.com" (encoding=gb2312).
UPDATE 2
I also tried to download a gzipped jquery script file, which contains only ascii chars. This time it works, which means to be a problem of Indy library. If I were not wrong, I should close the question.
TIdHTTP
handles the gzip decompression for you, if you have aTIdCompressorZLib
component assigned to theTIdHTTP.Compressor
property. Otherwise, you will have to decompress it manually (TIdHTTP
will not send anAccept-Encoding
header by default if theCompressor
property is not assigned).As for the UTF-8 encoding,
TIdHTTP
also handles that for you as well, if you are calling the overloaded version of theTIdHTTP.Get()
orTIdHTTP.Post()
method that returns aString
value instead of fill aTStream
object. It will decode the UTF-8 to UTF-16 for you. To convert that to CP936, you can let the RTL do the conversion for you:Do not use any auto detect encoding, it cannot be done reliably. Simply believe the Content-Type header.
If the Content-Type header is missing or lying, then you need to detect encoding. Although I would not use any algorithm that would misdetect UTF-8 as CP936...