Newline control characters in multi-byte character

2019-02-24 08:31发布

问题:

I have some Perl code that translates new-lines and line-feeds to a normalized form. The input text is Japanese, so that there will be multi-byte characters.

Is it still possible to do this transformation on a byte-by-byte basis (which I think it currently does), or do I have to detect the character set and enable Unicode support? In other words, are the popular encodings (Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) using bytes as part of their character set that could be mistaken for ASCII control characters?

I need only CR and LF to work.

Update: Added ISO-2022-JP. And that is the one that looks the most troublesome with its funky escape sequences ...

回答1:

None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.

For ISO-2022-JP,
First-byte range: 0x21 - 0x7E
Second-byte range: 0x21 - 0x7E

And the escape sequence characters to switch back and forth between various character sets are:

0x1B, 0x28, 0x24, 0x40, 0x42, and 0x4A

As you can see, none of the characters used to encode Japanese characters in ISO-2022-JP overlap with CR or LF.

For Shift-JIS,
First-byte range: 0x81 - 0x9F, 0xE0 - 0xEF
Second-byte range: 0x40 - 0x7E, 0x80 - 0xFC
Half-width katakana: 0xA1 - 0xDF

Again, there is no overlap with CR and LF.



回答2:

All of those character sets are identical to ASCII for the first 128 code points--that is, they only use one byte to encode ASCII characters, including CR (0x0D) and LF (0x0A). You shouldn't have any problem.



回答3:

ISO-2022-JP uses Shift-In/Shift-Out to assign different meanings to the 94 printable ASCII characters, leaving the control characters including CR and LF untouched.



回答4:

Here is the (normative) detail on UTF-8 encoding: «[…] the values 0x00..0x7F do not appear in any byte for the representation of any other Unicode code point […].» — from «The Unicode® Standard — Version 11.0 – Core Specification» — June 2018 — https://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf