Haskell source encoding

The Haskell 2010 Language Report says:

Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.

Does this mean UTF-8?

In ghc-7.0.4/compiler/parser/Lexer.x.source:

$unispace    = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar   = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab         = \t

$ascdigit  = 0-9
$unidigit  = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
$digit     = [$ascdigit $unidigit]

$special   = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\@\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol    = [$ascsymbol $unisymbol] # [$special \_\:\"\']

$unilarge  = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge  = [A-Z]
$large     = [$asclarge $unilarge]

$unismall  = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall  = [a-z]
$small     = [$ascsmall $unismall \_]

$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic   = [$small $large $symbol $digit $special $unigraphic \:\"\']

...I'm not sure what to make of this. alexGetChar wasn't really helpful.

标签： haskell encoding

3条回答

闹够了就滚

2楼-- · 2019-03-28 03:37

While the Haskell standard simply says Unicode the set of possible characters (as opposed to e.g. ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.

Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded ^* which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.

* - This is actually a real problem as GHC stores strings and more importantly Data.Text internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-03-28 03:46

There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.

In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2019-03-28 04:00

Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.

Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.

0人赞添加讨论(0) 举报

Haskell source encoding

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间