Can a regular expression match whitespace or the start of a string?
I'm trying to replace currency the abbreviation GBP with a £ symbol. I could just match anything starting GBP, but I'd like to be a bit more conservative, and look for certain delimiters around it.
>>> import re
>>> text = u'GBP 5 Off when you spend GBP75.00'
>>> re.sub(ur'GBP([\W\d])', ur'£\g<1>', text) # matches GBP with any prefix
u'\xa3 5 Off when you spend \xa375.00'
>>> re.sub(ur'^GBP([\W\d])', ur'£\g<1>', text) # matches at start only
u'\xa3 5 Off when you spend GBP75.00'
>>> re.sub(ur'(\W)GBP([\W\d])', ur'\g<1>£\g<2>', text) # matches whitespace prefix only
u'GBP 5 Off when you spend \xa375.00'
Can I do both of the latter examples at the same time?
This replaces GBP if it's preceded by the start of a string or a word boundary (which the start of a string already is), and after GBP comes a numeric value or a word boundary:
This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough?
I think you're looking for
'(^|\W)GBP([\W\d])'
Yes, why not?
matches the start of the string, 0 or more whitespaces, then GBP...
edit: Oh, I think you want alternation, use the
|
:You can always trim leading and trailing whitespace from the token before you search if it's not a matching/grouping situation that requires the full line.
Use the OR "
|
" operator:A left-hand whitespace boundary - a position in the string that is either a string start or right after a whitespace character - can be expressed with
See a regex demo. Python 3 demo:
Note you may use
\1
instead of\g<1>
in the replacement pattern since there is no need in an unambiguous backreference when it is not followed with a digit.BONUS: A right-hand whitespace boundary can be expressed with the following patterns: