i am currently having trouble with results from the amazon api.
the service returns a string with unicode characters: Learn Objective\xE2\x80\x93C on the Mac (Learn Series)
with ruby 1.9.1 the string could not even been processed:
REXML::ParseException: #<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
...
Exception parsing
Line: 1
Position: 1636
Last 80 unconsumed characters:
Learn Objective–C on the Mac (Learn Series)
As the exception points, your string is ASCII-8BIT encoded. You should change the encoding. There is a long story about that, but if you are interested in quick solution, just force_encoding
on the string before you do any processing:
s = "Learn Objective\xE2\x80\x93C on the Mac"
# => "Learn Objective\xE2\x80\x93C on the Mac"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.force_encoding 'utf-8'
# => "Learn Objective–C on the Mac"
Mladen's solution works if everything that is encoded in ASCII-8BIT can actually be converted directly to UTF-8. It breaks when there are characters that are 1) invalid, or 2) undefined in UTF-8. However, this will work (in 1.9.2 and up:
new_str = s.encode('utf-8', 'binary', :invalid => :replace,
:undef => :replace, :replace => '')
ASCII-8BIT is effectively binary. This code converts the encoding to UTF-8, while properly dealing with invalid and undefined characters. The :invalid option specifies that invalid characters be replaced. The :undef option specifies that undefined characters be replaced. And the :replace option defines what the invalid or undefined characters should be replaced with. In this case, I opted to simply remove them.