I know there are a lot of other posts about parsing comma-separated values, but I couldn't find one that splits key-value pairs and handles quoted commas.
I have strings like this:
age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"
And I want to get this:
{
'age': '12',
'name': 'bob',
'hobbies': 'games,reading',
'phrase': "I'm cool!",
}
I tried using shlex
like this:
lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''')
lexer.whitespace_split = True
lexer.whitespace = ','
props = dict(pair.split('=', 1) for pair in lexer)
The trouble is that shlex
will split the hobbies
entry into two tokens, i.e. hobbies="games
and reading"
. Is there a way to make it take the double quotes into account? Or is there another module I can use?
EDIT: Fixed typo for whitespace_split
EDIT 2: I'm not tied to using shlex
. Regex is fine too, but I didn't know how to handle the matching quotes.
It's possible to do with a regular expression. In this case, it might actually be the best option, too. I think this will work with most input, even escaped quotes such as this one:
phrase='I\'m cool'
With the VERBOSE flag, it's possible to make complicated regular expressions quite readable.
You just needed to use your
shlex
lexer in POSIX mode.Add
posix=True
when creating the lexer.(See the shlex parsing rules)
Outputs :
PS : Regular expressions won't be able to parse key-value pairs as long as the input can contain quoted
=
or,
characters. Even preprocessing the string wouldn't be able to make the input be parsed by a regular expression, because that kind of input cannot be formally defined as a regular language.Python seems to offer many ways to solve the task. Here is a little more c like implemented way, processing each char. Would be interesting to know different run times.
demo: http://repl.it/6oC/1
You could abuse Python tokenizer to parse the key-value list:
Output
You could use a finite-state machine (FSM) to implement a stricter parser. The parser uses only the current state and the next token to parse input:
Ok, I actually figured out a pretty nifty way, which is to split on both comma and equal sign, then take 2 tokens at a time.
Then you get:
However, this doesn't check that you don't have weird stuff like:
age,12=name,bob
, but I'm ok with that in my use case.EDIT: Handle both double-quotes and single-quotes.