Ruby to_json issue with error “illegal/malformed u

2020-02-10 04:30发布

问题:

I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8 when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?

$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
  from (irb):3:in `to_json'
  from (irb):3
  from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'

回答1:

\xAE is not a valid character in UTF-8, you have to use \u00AE instead:

"iPhone\u00AE"
#=> "iPhone®"

Or convert it accordingly:

"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"


回答2:

Every string in Ruby has a underlaying encoding. Depending on your LANG and LC_ALL environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.

$ irb
1.9.3p392 :008 > __ENCODING__
 => #<Encoding:UTF-8>

(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).

__ENCODING__ returns the current source encoding. Yours will probably also say UTF-8.

When you create literal strings and use byte escapes (the \xAE) in your code, Ruby is trying to interpret that according to the string encoding:

1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
 => {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
 => #<Encoding:UTF-8>

So, the byte \xAE at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:

1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
 => nil

You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):

1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
 => {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
 => "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"

Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:

1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
 => {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
 => "{\"description1\":\"iPhone®\"}"

Hope it helps.