how to convert character encoding with ruby 1.9

2020-02-26 08:49发布

问题:

i am currently having trouble with results from the amazon api.

the service returns a string with unicode characters: Learn Objective\xE2\x80\x93C on the Mac (Learn Series)

with ruby 1.9.1 the string could not even been processed:

REXML::ParseException: #<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>

...

Exception parsing

Line: 1

Position: 1636

Last 80 unconsumed characters:

Learn Objective–C on the Mac (Learn Series)

回答1:

As the exception points, your string is ASCII-8BIT encoded. You should change the encoding. There is a long story about that, but if you are interested in quick solution, just force_encoding on the string before you do any processing:

s = "Learn Objective\xE2\x80\x93C on the Mac"
# => "Learn Objective\xE2\x80\x93C on the Mac"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.force_encoding 'utf-8'
# => "Learn Objective–C on the Mac"


回答2:

Mladen's solution works if everything that is encoded in ASCII-8BIT can actually be converted directly to UTF-8. It breaks when there are characters that are 1) invalid, or 2) undefined in UTF-8. However, this will work (in 1.9.2 and up:

new_str = s.encode('utf-8', 'binary', :invalid => :replace, 
  :undef => :replace, :replace => '')

ASCII-8BIT is effectively binary. This code converts the encoding to UTF-8, while properly dealing with invalid and undefined characters. The :invalid option specifies that invalid characters be replaced. The :undef option specifies that undefined characters be replaced. And the :replace option defines what the invalid or undefined characters should be replaced with. In this case, I opted to simply remove them.