可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a naive "parser" that simply does something like:
[x.split('=') for x in mystring.split(',')]
However mystring can be something like
'foo=bar,breakfast=spam,eggs'
Obviously,
The naive splitter will just not do it. I am limited to Python 2.6 standard library for this,
So for example pyparsing can not be used.
Expected output is
[('foo', 'bar'), ('breakfast', 'spam,eggs')]
I'm trying to do this with regex, but am facing the following problems:
My First attempt
r'([a-z_]+)=(.+),?'
Gave me
[('foo', 'bar,breakfast=spam,eggs')]
Obviously,
Making .+
non-greedy does not solve the problem.
So,
I'm guessing I have to somehow make the last comma (or $
) mandatory.
Doing just that does not really work,
r'([a-z_]+)=(.+?)(?:,|$)'
As with that the stuff behind the comma in an value containing one is omitted,
e.g. [('foo', 'bar'), ('breakfast', 'spam')]
I think I must use some sort of look-behind(?) operation.
The Question(s)
1. Which one do I use? or
2. How do I do that/this?
Edit:
Based on daramarak's answer below,
I ended up doing pretty much the same thing as abarnert later suggested in a slightly more verbose form;
vals = [x.rsplit(',', 1) for x in (data.split('='))]
ret = list()
while vals:
value = vals.pop()[0]
key = vals[-1].pop()
ret.append((key, value))
if len(vals[-1]) == 0:
break
EDIT 2:
Just to satisfy my curiosity, is this actually possible with pure regular expressions? I.e so that re.findall()
would return a list of 2-tuples?
回答1:
Just for comparison purposes, here's a regex that seems to solve the problem as well:
([^=]+) # key
= # equals is how we tokenise the original string
([^=]+) # value
(?:,|$) # value terminator, either comma or end of string
The trick here it to restrict what you're capturing in your second group. .+
swallows the =
sign, which is the character we can use to distinguish keys from values. The full regex doesn't rely on any back-tracking (so it should be compatible with something like re2, if that's desirable) and can work on abarnert's examples.
Usage as follows:
re.findall(r'([^=]+)=([^=]+)(?:,|$)', 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam')
Which returns:
[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]
回答2:
daramarak's answer either very nearly works, or works as-is; it's hard to tell from the way the sample output is formatted and the vague descriptions of the steps. But if it's the very-nearly-works version, it's easy to fix.
Putting it into code:
>>> bits=[x.rsplit(',', 1) for x in s.split('=')]
>>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)]
The first line is (I believe) daramarak's answer. By itself, the first line gives you pairs of (value_i, key_i+1)
instead of (key_i, value_i)
. The second line is the most obvious fix for that. With more intermediate steps, and a bit of output, to see how it works:
>>> s = 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam'
>>> bits0 = s.split('=')
>>> bits0
['foo', 'bar,breakfast', 'spam,eggs,blt', 'bacon,lettuce,tomato,spam', 'spam']
>>> bits = [x.rsplit(',', 1) for x in bits0]
>>> bits
[('foo'), ('bar', 'breakfast'), ('spam,eggs', 'blt'), ('bacon,lettuce,tomato', 'spam'), ('spam')]
>>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)]
>>> kv
[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]
回答3:
Could I suggest that you use the split operations as before. But split at the equals first, then splitting at the rightmost comma, to make a single list of left and right strings.
input =
"bob=whatever,king=kong,banana=herb,good,yellow,thorn=hurts"
will at first split become
first_split = input.split("=")
#first_split = ['bob' 'whatever,king' 'kong,banana' 'herb,good,yellow,thorn' 'hurts']
then splitting at rightmost comma gives you:
second_split = [single_word for sublist in first_split for item in sublist.rsplit(",",1)]
#second_split = ['bob' 'whatever' 'king' 'kong' 'banana' 'herb,good,yellow' 'thorn' 'hurts']
then you just gather the pairs like this:
pairs = dict(zip(second_split[::2],second_split[1::2]))
回答4:
Can you try this, it worked for me:
mystring = "foo=bar,breakfast=spam,eggs,e=a"
n = []
i = 0
for x in mystring.split(','):
if '=' not in x:
n[i-1] = "{0},{1}".format(n[i-1], x)
else:
n.append(x)
i += 1
print n
You get result like:
['foo=bar', 'breakfast=spam,eggs', 'e=a']
Then you can simply go over list and do what you want.
回答5:
Assuming that the name of the key never contains ,
, you can split at ,
when the next sequence without ,
and =
is succeeded by =
.
re.split(r',(?=[^,=]+=)', inputString)
(This is the same as my original solution. I expect re.split
to be used, rather than re.findall
or str.split
).
The full solution can be done in one-liner:
[re.findall('(.*?)=(.*)', token)[0] for token in re.split(r',(?=[^,=]+=)', inputString)]