I have this string:
%{Children^10 Health "sanitation management"^5}
And I want to convert it to tokenize this into an array of hashes:
[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]
I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.
Any pointers?
Here is a non-robust example using
StringScanner
. This is code I just adapted from Ruby Quiz: Parsing JSON, which has an excellent explanation.For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:
If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.
A quick breakdown of the regex:
\w+
matches any single-term keywords(?:\\.|[^\\"]])*
uses non-capturing parentheses ((?:...)
) to match the contents of an escaped double quoted string - either an escaped symbol (\n
,\"
,\\
, etc.) or any single character that's not an escape symbol or an end quote."((?:\\.|[^\\"]])*)"
captures only the contents of a quoted keyword phrase.(?:(\w+)|"((?:\\.|[^\\"])*)")
matches any keyword - single term or phrase, capturing single terms into$1
and phrase contents into$2
\d+
matches a number.\^(\d+)
captures a number following a caret (^
). Since this is the third set of capturing parentheses, it will be caputred into$3
.(?:\^(\d+))?
captures a number following a caret if it's there, matches the empty string otherwise.String#scan(regex)
matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so$1
becomesmatch[0]
,$2
becomesmatch[1]
, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to anil
entry in the resulting "match".The
#map
then takes these matches, uses some block magic to break each captured term into different variables (we could have donedo |match| ; word,phrase,boost = *match
), and then creates your desired hashes. Exactly one ofword
orphrase
will benil
, since both can't be matched against the input, so(word || phrase)
will return the non-nil
one, and#downcase
will convert it to all lowercase.boost.to_i
will convert a string to an integer while(boost.nil? ? nil : boost.to_i)
will ensure thatnil
boosts staynil
.What you have here is an arbitrary grammar, and to parse it what you really want is a lexer - you can write a grammar file that described your syntax and then use the lexer to generate a recursive parser from your grammar.
Writing a lexer (or even a recursive parser) is not really trivial - although it is a useful exercise in programming - but you can find a list of Ruby lexers/parsers in this email message here: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html
RACC is available as a standard module of Ruby 1.8, so I suggest you concentrate on that even if its manual is not really easy to follow and it requires familiarity with yacc.