Why does ruby's JSON parser eat my backslash?

2019-07-23 20:13发布

问题:

The following example in JSON format contains one backslash, and if I run JSON.load, the backslash disappears:

JSON.load('{ "88694": { "regex": ".*?\. (CVE-2015-46055)" } }')
# => {"88694"=>{ "regex"=>".*?. (CVE-2015-46055)"}}

How can I keep the backslash?

My goal is to have this structure, and whenever I need, read the file, load the JSON into Hash, and search for those regular expressions.

UPDATE 1

here is an example what I want.

irb> "stack.overflow"[/.*?\./]
=> "stack."

I can't pass the regex from JSON to my string in order to catch that ".", because the "\." disappears.

回答1:

str = '{ "88694": { "regex": ".*?\. (CVE-2015-46055)" } }'
  #=> "{ \"88694\": { \"regex\": \".*?\\. (CVE-2015-46055)\" } }"

str.chars
  #=> ["{", " ", "\"", "8", "8", "6", "9", "4", "\"", ":", " ", "{", " ",
  #   "\"", "r", "e", "g", "e", "x", "\"", ":", " ", "\"", ".", "*", "?",
  #   "\\", ".",
  #   ~~~   ~~                                        
  #   " ", "(",..., "}", " ", "}"]

This shows us that str does indeed contain a backslash character followed by a period. The reason is that str is enclosed in single quotes. \. would only be treated as an escaped period if str were enclosed in double quotes:

 "{ '88694': { 'regex': '.*?\. (CVE-2015-46055)' } }".chars[25,3]
   #=> ["?", ".", " "] 

The return value of str converts the single-quoted string to a double-quoted string:

"{ \"88694\": { \"regex\": \".*?\\. (CVE-2015-46055)\" } }"

\\ is one backslash character followed by a period. With the double quotes the period can now be escaped, but it is not preceded by a backslash, only by a backspace character.

Now let's add another backslash and see what happens:

str1 = '{ "88694": { "regex": ".*?\\. (CVE-2015-46055)" } }' 
str1.chars == str.chars
  #=> true

The result is the same. That is because single quotes support the escape sequence \\ (single backslash) (and only one other: \' [single quote]).

Now let's add a third backslash:

str2 = '{ "88694": { "regex": ".*?\\\. (CVE-2015-46055)" } }'   
str2.chars
  #=> ["{", " ", "\"", "8", "8", "6", "9", "4", "\"", ":", " ", "{", " ",
  #   "\"", "r", "e", "g", "e", "x", "\"", ":", " ", "\"", ".", "*", "?",
  #   "\\", "\\", ".",
  #   ~~~~  ~~~~  ~~~                                        
  #   " ", "(",..., "}", " ", "}"]

Surprised? \\ produces one backslash character (escaped backslash in single quotes), \ products a second backslash character (backslash in single quotes) and . is a period in single quotes.

We obtain:

s = {"88694"=>{"regex"=>".*?\\. (CVE-2015-46055)"}.to_json

JSON.parse(str)
  #=> {"88694"=>{"regex"=>".*?. (CVE-2015-46055)"}} 
JSON.parse(str1)
  #=> {"88694"=>{"regex"=>".*?. (CVE-2015-46055)"}} 
JSON.parse(str2)
  #=> {"88694"=>{"regex"=>".*?\\. (CVE-2015-46055)"}} 

str2 is what we want, as

JSON.parse(str2)["88694"]["regex"].chars[2,4]   
  #=> ["?", "\\", ".", " "] 

We could alternatively work backwards:

js = {"88694"=>{"regex"=>".*?\\. (CVE-2015-46055)"}}.to_json
  #=> "{\"88694\":{\"regex\":\".*?\\\\. (CVE-2015-46055)\"}}" 

'{"88694":{"regex":".*?\\\. (CVE-2015-46055)"}}' == js
  #=> true

This string is the same as str2 after all spaces outside of quoted substrings have been removed.

It appears that JSON treats two successive backslash characters as one backslash character. See @Jordan's comment.

Perhaps a reader can elaborate what JSON is doing here.