I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8
when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?
$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from (irb):3:in `to_json'
from (irb):3
from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'
\xAE
is not a valid character in UTF-8, you have to use \u00AE
instead:
"iPhone\u00AE"
#=> "iPhone®"
Or convert it accordingly:
"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"
Every string in Ruby has a underlaying encoding. Depending on your LANG
and LC_ALL
environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.
$ irb
1.9.3p392 :008 > __ENCODING__
=> #<Encoding:UTF-8>
(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).
__ENCODING__
returns the current source encoding. Yours will probably also say UTF-8.
When you create literal strings and use byte escapes (the \xAE
) in your code, Ruby is trying to interpret that according to the string encoding:
1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
=> #<Encoding:UTF-8>
So, the byte \xAE
at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:
1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
=> nil
You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):
1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
=> {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
=> "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"
Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:
1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
=> {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
=> "{\"description1\":\"iPhone®\"}"
Hope it helps.