Python regex expression to remove hyphens between

I need to remove the hyphens between lowercase letters only. This is the expression that I have currently:

re.sub('\[a-z]-\[a-z]', "", 'hyphen-ated Asia-Pacific 11-12')

I want it to return:

'hyphenated Asia-Pacific 11-12'

标签： python regex hyphen

3条回答

老娘就宠你

2楼-- · 2020-05-10 08:22

Two approaches including some timing:

import re, timeit

def a1():
    s = re.sub(r'([a-z])-([a-z])', r'\1\2', "hyphen-ated Asia-Pacific 11-12")

def a2():
    s = re.sub(r'(?<=[a-z])-(?=[a-z])', '', "hyphen-ated Asia-Pacific 11-12")

print(timeit.timeit(a1, number = 10**5))
print(timeit.timeit(a2, number = 10**5))

Yields

0.9709542730015528
0.37731508900105837

So, in this case lookarounds might be faster.

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2020-05-10 08:28

re.sub(r'([a-z])-([a-z])', r'\1\2', "hyphen-ated Asia-Pacific 11-12")

Captures the letters before and after the hyphen and preserves them when stripping the hyphen. \1 and \2 denotes the first and second captured group, which are the letters in this case.

Your current code matches the two letters around the hyphen and removes the whole match. You should preserve the letters when substituting.

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2020-05-10 08:37

TL;DR:

>>> re.sub('([a-z])-(?=[a-z])', r'\1', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

>>> re.sub('(?<=[a-z])-(?=[a-z])', '', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

The main complication of a contextual replacement ("find all hyphens surrounded by lower-case letters") is that the trailing context (the part following the pattern to match) must not be included in the match. If it is, it will not be able to participate in the next leading match.

An example would probably make that clearer.

The naive solution would be

>>> re.sub('([a-z])-([a-z])', r'\1\2', 'hyphen-ated Asia-Pacific 11-12')
'hyphenated Asia-Pacific 11-12'

which differs from the call in the question because it matches the lower case letters around the hyphen, capturing them so that they can be reinserted into the result. In this case, the only substring matched by the patter was n-a and it was correctly replaced with na.

But suppose we had two hyphens closer together, like this:

>>> re.sub('([a-z])-([a-z])', r'\1\2', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obliga-tory hyphenated Asia-Pacific 11-12'

The a was part of the match g-a and the search resumed at the - following the a. So it never saw the pattern a-t, which would have matched.

To fix this problem, we can use a lookahead assertion:

>>> re.sub('([a-z])-(?=[a-z])', r'\1', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

Now the trailing context (the lower-case letter following the hyphen) is not part of the match, and consequently we don't need to reinsert it in the replace. That means that after matching g- with a trailing a, the search resumes starting at a and the next match will be a- with trailing t.

Python can also do "lookbehinds", in which a pattern only matches if another pattern precedes it. Using both a lookbehind and a lookahead, we could write:

>>> re.sub('(?<=[a-z])-(?=[a-z])', '', 'oblig-a-tory hyphen-ated Asia-Pacific 11-12')
'obligatory hyphenated Asia-Pacific 11-12'

This also produces the correct answer. Now we're just matching the hyphen, but insisting that it be preceded and followed by a lower-case letter. Since the match is just the hyphen, the replacement string can be empty.

Sometimes using a lookbehind like this speeds up the match. Sometimes it slows it down. It's always worth doing a benchmark with a particular pattern if speed matters to you. But the first task is to get the match right.

0人赞添加讨论(0) 举报

Python regex expression to remove hyphens between

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间