How can I easily create a mapping from a UTF-8 bytestream to a Unicode codepoint array? To clarify, if for example I have the byte sequence:
c3 a5 76 aa e2 82 ac
The mapping should produce two arrays of the same length; one with UTF-8 byte sequences, and the other with the corresponding Unicode codepoint. Then, the arrays could be printed side-by-side like:
UTF8 UNICODE
----------------------------------------
C3 A5 000000E5
76 00000076
AA 0000FFFD
E2 82 AC 000020AC
A solution that works with streams:
The above only returns U+FFFD for what
Encode::decode('UTF-8', $bytes)
considered ill-formed. In other words, it only returns U+FFFD when it encounters on of the following:Post-decoding checks are still needed to return U+FFFD for what
Encode::decode('UTF-8', $bytes)
considers otherwise illegal.Here is a way to do it (the script takes the byte sequence as the first command line argument):
Encode has an API for incremental decoding but it's undocumented, Your mileage may vary! It's used by subclasses of Encode::Encoding and PerlIO::encoding. As with any undocumented API it's a subject to change at any time. There has been an effort to document the API.
Output: