Python regex expression to remove hyphens between

2020-05-10 08:22发布

问题:

I need to remove the hyphens between lowercase letters only. This is the expression that I have currently:

re.sub('\[a-z]-\[a-z]', "", 'hyphen-ated Asia-Pacific 11-12')

I want it to return:

'hyphenated Asia-Pacific 11-12'

回答1:

Two approaches including some timing:

import re, timeit

def a1():
    s = re.sub(r'([a-z])-([a-z])', r'\1\2', "hyphen-ated Asia-Pacific 11-12")

def a2():
    s = re.sub(r'(?<=[a-z])-(?=[a-z])', '', "hyphen-ated Asia-Pacific 11-12")

print(timeit.timeit(a1, number = 10**5))
print(timeit.timeit(a2, number = 10**5))

Yields

0.9709542730015528
0.37731508900105837

So, in this case lookarounds might be faster.



回答2:

TL;DR:

>>> re.sub('([a-z])-(?=[a-z])', r'\1', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

or

>>> re.sub('(?<=[a-z])-(?=[a-z])', '', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

The main complication of a contextual replacement ("find all hyphens surrounded by lower-case letters") is that the trailing context (the part following the pattern to match) must not be included in the match. If it is, it will not be able to participate in the next leading match.

An example would probably make that clearer.

The naive solution would be

>>> re.sub('([a-z])-([a-z])', r'\1\2', 'hyphen-ated Asia-Pacific 11-12')
'hyphenated Asia-Pacific 11-12'

which differs from the call in the question because it matches the lower case letters around the hyphen, capturing them so that they can be reinserted into the result. In this case, the only substring matched by the patter was n-a and it was correctly replaced with na.

But suppose we had two hyphens closer together, like this:

>>> re.sub('([a-z])-([a-z])', r'\1\2', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obliga-tory hyphenated Asia-Pacific 11-12'

The a was part of the match g-a and the search resumed at the - following the a. So it never saw the pattern a-t, which would have matched.

To fix this problem, we can use a lookahead assertion:

>>> re.sub('([a-z])-(?=[a-z])', r'\1', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

Now the trailing context (the lower-case letter following the hyphen) is not part of the match, and consequently we don't need to reinsert it in the replace. That means that after matching g- with a trailing a, the search resumes starting at a and the next match will be a- with trailing t.

Python can also do "lookbehinds", in which a pattern only matches if another pattern precedes it. Using both a lookbehind and a lookahead, we could write:

>>> re.sub('(?<=[a-z])-(?=[a-z])', '', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

This also produces the correct answer. Now we're just matching the hyphen, but insisting that it be preceded and followed by a lower-case letter. Since the match is just the hyphen, the replacement string can be empty.

Sometimes using a lookbehind like this speeds up the match. Sometimes it slows it down. It's always worth doing a benchmark with a particular pattern if speed matters to you. But the first task is to get the match right.



回答3:

re.sub(r'([a-z])-([a-z])', r'\1\2', "hyphen-ated Asia-Pacific 11-12")

Captures the letters before and after the hyphen and preserves them when stripping the hyphen. \1 and \2 denotes the first and second captured group, which are the letters in this case.

Your current code matches the two letters around the hyphen and removes the whole match. You should preserve the letters when substituting.