It seems that GHC is at least inconsistent in the character encoding it decides to decode from.
Consider a file, omatase-shimashita.txt
, with the following content, encoded in UTF-8: お待たせしました
readFile
seems to read this in properly...
Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました
However, if I write a simple "echo" server, it does not decode with a default of UTF-8. Consider the following code that handles an incoming client:
handleClient handle = do
line <- hGetLine handle
putStrLn $ "Read following line: " ++ toString line
handleClient handle
And the relevant client code, explicitly sending UTF-8:
Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"
Is this not inconsistent behavior? Is there any method to this madness? I am planning to rewrite my application(s) to explicitly use ByteString
objects and explicitly encode and decode using Codec.Binary.UTF8
, but it would be good to know what's going on here anyway... :o/
UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8...
$ cat /etc/default/locale
LANG="en_US.UTF-8"
$ echo $LANG
en_US.UTF-8
Your first example uses the standard IO library,
System.IO
. Operations in this library use the default system encoding (also known aslocaleEncoding
) unless you specify otherwise. Presumably your system is set up to use UTF-8, so that is the encoding used byputStrLn
,hGetContents
and so on.Your second example uses
Data.ByteString
. Since this library deals in sequences of bytes only, it does no encoding or decoding. SoData.ByteString.hGetLine
converts the bytes in the file directly to aByteString
.The best way to do text I/O in general is to use the text package.
Which version of GHC are you using? Older versions especially didn't do unicode I/O very well.
This section in the GHC documentation describes how to change input/output encodings:
http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23
Also, the documentation says this:
Maybe this has something to do with your problem? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console.
Try running
hSetEncoding handle utf8
and see if it fixes your problem.