I need to remove the hyphens between lowercase letters only. This is the expression that I have currently:
re.sub('\[a-z]-\[a-z]', "", 'hyphen-ated Asia-Pacific 11-12')
I want it to return:
'hyphenated Asia-Pacific 11-12'
I need to remove the hyphens between lowercase letters only. This is the expression that I have currently:
re.sub('\[a-z]-\[a-z]', "", 'hyphen-ated Asia-Pacific 11-12')
I want it to return:
'hyphenated Asia-Pacific 11-12'
Two approaches including some timing:
Yields
So, in this case lookarounds might be faster.
Captures the letters before and after the hyphen and preserves them when stripping the hyphen.
\1
and\2
denotes the first and second captured group, which are the letters in this case.Your current code matches the two letters around the hyphen and removes the whole match. You should preserve the letters when substituting.
TL;DR:
or
The main complication of a contextual replacement ("find all hyphens surrounded by lower-case letters") is that the trailing context (the part following the pattern to match) must not be included in the match. If it is, it will not be able to participate in the next leading match.
An example would probably make that clearer.
The naive solution would be
which differs from the call in the question because it matches the lower case letters around the hyphen, capturing them so that they can be reinserted into the result. In this case, the only substring matched by the patter was
n-a
and it was correctly replaced withna
.But suppose we had two hyphens closer together, like this:
The
a
was part of the matchg-a
and the search resumed at the-
following thea
. So it never saw the patterna-t
, which would have matched.To fix this problem, we can use a lookahead assertion:
Now the trailing context (the lower-case letter following the hyphen) is not part of the match, and consequently we don't need to reinsert it in the replace. That means that after matching
g-
with a trailinga
, the search resumes starting ata
and the next match will bea-
with trailingt
.Python can also do "lookbehinds", in which a pattern only matches if another pattern precedes it. Using both a lookbehind and a lookahead, we could write:
This also produces the correct answer. Now we're just matching the hyphen, but insisting that it be preceded and followed by a lower-case letter. Since the match is just the hyphen, the replacement string can be empty.
Sometimes using a lookbehind like this speeds up the match. Sometimes it slows it down. It's always worth doing a benchmark with a particular pattern if speed matters to you. But the first task is to get the match right.