In IRB, I'm trying the following:
1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
=> "\xBF"
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'
Any ideas what's going wrong?
If you're only working with ascii characters you can use
But what happens if we use the same approach with valid UTF8 characters that are invalid in ascii
Uh oh! We want frío to remain with the accent. Here's an option that keeps the valid UTF8 characters
Also in Ruby 2.1 there is a new method called
scrub
that solves this problemI'd guess that
"\xBF"
already thinks it is encoded in UTF-8 so when you callencode
, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:\xBF
isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form ofencode
:You can force the issue by telling
encode
to ignore what the string thinks its encoding is and treat it as binary data:Where
s
is the"\xBF"
that thinks it is UTF-8 from above.You could also use
force_encoding
ons
to force it to be binary and then use the two-argumentencode
:This is fixed if you read the source text file in using an explicit code page: