shortend URL with my current regex in regexpal:
http://bit.ly/1jbOFGd
I have a line of key=value pairs, space delimited. Some values contain spaces and punctuation so I do a positive lookahead to check for the existence of another key.
I want to tokenize the key and value, which I later convert to a dict in python.
My guess is that I can speed this up by getting rid of .*? but how? In python I convert 10,000 of these lines in 4.3 seconds. I'd like to double or triple that speed by making this regex match more efficient.
Update:
(?<=\s|\A)([^\s=]+)=(.*?)(?=(?:\s[^\s=]+=|$))
I would think this one is more efficient than yours (even though it still uses the .*?
for the value, its lookahead is no where near as complex and doesn't use a lazy modifier), but I'll need you to test. This does the same as my original expression, but handles values differently. It uses a lazy .*?
match followed by a lookahead that is either a space, followed by a key, followed by a =
OR the end of the string. Notice I always define a key as [^\s=]+
, so keys cannot contain an equal sign or whitespace (being this specific helps us avoid lazy matches).
Source
Original:
Are there some rules I am missing that you need by doing something this simple?
(?<=\s|\A)([^=]+)=([\S]+)
This starts with a lookbehind of either a space character (\s
) or the beginning of the string (\A
). Then we match everything except =
, followed by a =
, and match everything except whitespace (\s
).
"Lookbehind" (related to 'lookahead' and 'lookaround') is the key 'regular expression' concept to read up on here - it let's you match and skip individual components of the string.
Good examples here: http://www.rexegg.com/regex-lookarounds.html.