RE error: illegal byte sequence on Mac OS X

2019-01-02 14:43发布

问题:

I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

And the error is:

sed: RE error: illegal byte sequence

I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:

sed -i "" 's|\"iphoneos-cross\"\,\"llvm-gcc\:\-O3|\"iphoneos-cross\"\,\"clang\:\-Os|g' Configure

I'm having a heck of a time debugging the issue. Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

回答1:

A sample command that exhibits the symptom: sed 's/./@/' <<<$'\xfc' fails, because byte 0xfc is not a valid UTF-8 char.
Note that, by contrast, GNU sed (Linux, but also installable on macOS) simply passes the invalid byte through, without reporting an error.

Using the formerly accepted answer is an option if you don't mind losing support for your true locale (if you're on a US system and you never need to deal with foreign characters, that may be fine.)

However, the same effect can be had ad-hoc for a single command only:

LC_ALL=C sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

Note: What matters is an effective LC_CTYPE setting of C, so LC_CTYPE=C sed ... would normally also work, but if LC_ALL happens to be set (to something other than C), it will override individual LC_*-category variables such as LC_CTYPE. Thus, the most robust approach is to set LC_ALL.

However, (effectively) setting LC_CTYPE to C treats strings as if each byte were its own character (no interpretation based on encoding rules is performed), with no regard for the - multibyte-on-demand - UTF-8 encoding that OS X employs by default, where foreign characters have multibyte encodings.

In a nutshell: setting LC_CTYPE to C causes the shell and utilities to only recognize basic English letters as letters (the ones in the 7-bit ASCII range), so that foreign chars. will not be treated as letters, causing, for instance, upper-/lowercase conversions to fail.

Again, this may be fine if you needn't match multibyte-encoded characters such as é, and simply want to pass such characters through.

If this is insufficient and/or you want to understand the cause of the original error (including determining what input bytes caused the problem) and perform encoding conversions on demand, read on below.


The problem is that the input file's encoding does not match the shell's.
More specifically, the input file contains characters encoded in a way that is not valid in UTF-8 (as @Klas Lindbäck stated in a comment) - that's what the sed error message is trying to say by invalid byte sequence.

Most likely, your input file uses a single-byte 8-bit encoding such as ISO-8859-1, frequently used to encode "Western European" languages.

Example:

The accented letter à has Unicode codepoint 0xE0 (224) - the same as in ISO-8859-1. However, due to the nature of UTF-8 encoding, this single codepoint is represented as 2 bytes - 0xC3 0xA0, whereas trying to pass the single byte 0xE0 is invalid under UTF-8.

Here's a demonstration of the problem using the string voilà encoded as ISO-8859-1, with the à represented as one byte (via an ANSI-C-quoted bash string ($'...') that uses \x{e0} to create the byte):

Note that the sed command is effectively a no-op that simply passes the input through, but we need it to provoke the error:

  # -> 'illegal byte sequence': byte 0xE0 is not a valid char.
sed 's/.*/&/' <<<$'voil\x{e0}'

To simply ignore the problem, the above LCTYPE=C approach can be used:

  # No error, bytes are passed through ('á' will render as '?', though).
LC_CTYPE=C sed 's/.*/&/' <<<$'voil\x{e0}'

If you want to determine which parts of the input cause the problem, try the following:

  # Convert bytes in the 8-bit range (high bit set) to hex. representation.
  # -> 'voil\x{e0}'
iconv -f ASCII --byte-subst='\x{%02x}' <<<$'voil\x{e0}'

The output will show you all bytes that have the high bit set (bytes that exceed the 7-bit ASCII range) in hexadecimal form. (Note, however, that that also includes correctly encoded UTF-8 multibyte sequences - a more sophisticated approach would be needed to specifically identify invalid-in-UTF-8 bytes.)


Performing encoding conversions on demand:

Standard utility iconv can be used to convert to (-t) and/or from (-f) encodings; iconv -l lists all supported ones.

Examples:

Convert FROM ISO-8859-1 to the encoding in effect in the shell (based on LC_CTYPE, which is UTF-8-based by default), building on the above example:

  # Converts to UTF-8; output renders correctly as 'voilà'
sed 's/.*/&/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

Note that this conversion allows you to properly match foreign characters:

  # Correctly matches 'à' and replaces it with 'ü': -> 'voilü'
sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

To convert the input BACK to ISO-8859-1 after processing, simply pipe the result to another iconv command:

sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')" | iconv -t ISO-8859-1


回答2:

Add the following lines to your ~/.bash_profile or ~/.zshrc file(s).

export LC_CTYPE=C 
export LANG=C


回答3:

mklement0's answer is great, but I have some small tweaks.

It seems like a good idea to explicitly specify bash's encoding when using iconv. Also, we should prepend a byte-order mark (even though the unicode standard doesn't recommend it) because there can be legitimate confusions between UTF-8 and ASCII without a byte-order mark. Unfortunately, iconv doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE or UTF-16LE), so we need to use UTF-16, which uses platform-specific endianness, and then use file --mime-encoding to discover the true endianness iconv used.

(I uppercase all my encodings because when you list all of iconv's supported encodings with iconv -l they are all uppercase.)

# Find out MY_FILE's encoding
# We'll convert back to this at the end
FILE_ENCODING="$( file --brief --mime-encoding MY_FILE )"
# Find out bash's encoding, with which we should encode
# MY_FILE so sed doesn't fail with 
# sed: RE error: illegal byte sequence
BASH_ENCODING="$( locale charmap | tr [:lower:] [:upper:] )"
# Convert to UTF-16 (unknown endianness) so iconv ensures
# we have a byte-order mark
iconv -f "$FILE_ENCODING" -t UTF-16 MY_FILE > MY_FILE.utf16_encoding
# Whether we're using UTF-16BE or UTF-16LE
UTF16_ENCODING="$( file --brief --mime-encoding MY_FILE.utf16_encoding )"
# Now we can use MY_FILE.bash_encoding with sed
iconv -f "$UTF16_ENCODING" -t "$BASH_ENCODING" MY_FILE.utf16_encoding > MY_FILE.bash_encoding
# sed!
sed 's/.*/&/' MY_FILE.bash_encoding > MY_FILE_SEDDED.bash_encoding
# now convert MY_FILE_SEDDED.bash_encoding back to its original encoding
iconv -f "$BASH_ENCODING" -t "$FILE_ENCODING" MY_FILE_SEDDED.bash_encoding > MY_FILE_SEDDED
# Now MY_FILE_SEDDED has been processed by sed, and is in the same encoding as MY_FILE


回答4:

My workaround had been using Perl:

find . -type f -print0 | xargs -0 perl -pi -e 's/was/now/g'


回答5:

My workaround had been using gnu sed. Worked fine for my purposes.



标签: