Parsing specific JSON-like data (NextSTEP PList) f

2019-08-14 04:44发布

问题:

i'm writing a client to a third-party API, and they provide data in a weird format. At first, it might look like JSON but it's not, and i'm a bit confused about how i should handle that.

It's a key-value based format (much like JSON).

  • Keys are separated by '=' from their values.
  • Keys and values are wrapped within double-quotes.
  • Dictionaries start with '{' and end with '}'.
  • Arrays start with '(' and end with ')'
  • Lines end with ';' (Excepted for arrays content) and end-of-line character (\r i think).
  • Sometimes, there seem to be unicode (Stuff like \U2623 for the BioHazard sign) in strings.

What could possibly be this format? Shall i use a premade gem to parse it, or should i build my own parser?

{ "anArray" = (
  "100",
  "200",
  "300"
  );
  "aDictionary" = {
    "aString" = "Something";
  };
}

EDIT This format seems to be Apple's property list, but it's not XML neither Binary... This make sense as the API is from a WebObjects webservice. i will try to use CFPropertyList gem to parse it, if there is a better solution, please let me know.

EDIT 2 This is a NextSTEP Property List.

回答1:

Here's a robust answer using a custom StringScanner-based parser. It allows whitespace to be optional, allows trailing commas after the last item in a list and allows omitting the semicolon after the last dictionary key/value pair. It allows the outermost item to be an dictionary, array, or string. And it allows really any sort of legal string content, including parens and curly braces and escaped text like \n.

Seen in action:

p parse('{ "array" = ( "1", "2", ( "3", "4" ) ); "hash"={ "key"={ "more"="oh}]yes;!"; }; }; }')
#=> {"array"=>["1", "2", ["3", "4"]], "hash"=>{"key"=>{"more"=>"oh}]yes;!"}}}

puts parse('("Escaped \"Quotes\" Allowed", "And Unicode \u2623 OK")')
#=> Escaped "Quotes" Allowed
#=> And Unicode ☣ OK

The code:

require 'strscan'
def parse(str)
  ss, getstr, getary, getdct = StringScanner.new(str)
  getvalue = ->{
    if    ss.scan /\s*\{\s*/   then getdct[]
    elsif ss.scan /\s*\(\s*/   then getary[]
    elsif str = getstr[]       then str
    elsif ss.scan /\s*[)}]\s*/ then nil end
  }
  getstr = ->{
    if str=ss.scan(/\s*"(?:[^"\\]|\\u\d+|\\.)*"\s*/i)
      eval str.gsub(/([^\\](?:\\\\)*)#(?=[{@$])/,'\1\#')
    end
  }
  getary = ->{
    [].tap do |a|
      while v=getvalue[]
        a << v
        ss.scan /\s*,\s*/
      end
    end
  }
  getdct = ->{
    {}.tap do |h|
      while key = getstr[]
        ss.scan /\s*=\s*/
        if value=getvalue[] then h[key]=value; ss.scan(/\s*;\s*/) end
        end
      end
    end
  }
  getvalue[]
end

As an alternative to rolling your own parser from scratch in the future, you might also want to look into the Treetop Ruby library.


Edit: I've replaced the implementation of getstr above with one that should prevent running arbitrary Ruby code inside the eval. For more details, see "Eval a string without interpolation". Seen in action:

@secret = "OH NO!"
$secret = "OH NO!"
@@secret = "OH NO!"
puts parse('"\"#{:NOT&&:very}\" bad. \u262E\n#@secret \\#$secret \\\\#@@secret"')


回答2:

Here's a very quick-and-dirty hack that transforms the syntax into valid Ruby and then evals it. Note that this could be dangerous. More importantly, this will convert all parentheses inside keys and values into square brackets.

def parse(str)
  eval(
    str
      .gsub( /" = (?=[({"])/, '" => ' )      # Dictionary separators become =>
      .gsub( /(?<=[)}"]); (?=[)}"])/, ', ' ) # Dictionary semicolons become ,
      .tr( '()', '[]' )                      # ALL parens become square brackets
  )
end

p parse('{ "anArray" = ( "100", "200", "300" ); "aDictionary" = { "aString" = "Something"; }; }')
#=> {"anArray"=>["100", "200", "300"], "aDictionary"=>{"aString"=>"Something"}}