optimizing regex to fine key=value pairs, space de

2019-08-04 11:57发布

问题:

shortend URL with my current regex in regexpal: http://bit.ly/1jbOFGd

I have a line of key=value pairs, space delimited. Some values contain spaces and punctuation so I do a positive lookahead to check for the existence of another key.

I want to tokenize the key and value, which I later convert to a dict in python.

My guess is that I can speed this up by getting rid of .*? but how? In python I convert 10,000 of these lines in 4.3 seconds. I'd like to double or triple that speed by making this regex match more efficient.

回答1:

Update:

(?<=\s|\A)([^\s=]+)=(.*?)(?=(?:\s[^\s=]+=|$))

I would think this one is more efficient than yours (even though it still uses the .*? for the value, its lookahead is no where near as complex and doesn't use a lazy modifier), but I'll need you to test. This does the same as my original expression, but handles values differently. It uses a lazy .*? match followed by a lookahead that is either a space, followed by a key, followed by a = OR the end of the string. Notice I always define a key as [^\s=]+, so keys cannot contain an equal sign or whitespace (being this specific helps us avoid lazy matches).

Source


Original:

Are there some rules I am missing that you need by doing something this simple?

(?<=\s|\A)([^=]+)=([\S]+)

This starts with a lookbehind of either a space character (\s) or the beginning of the string (\A). Then we match everything except =, followed by a =, and match everything except whitespace (\s).



回答2:

"Lookbehind" (related to 'lookahead' and 'lookaround') is the key 'regular expression' concept to read up on here - it let's you match and skip individual components of the string.

Good examples here: http://www.rexegg.com/regex-lookarounds.html.