Regular expression: match start or whitespace

Can a regular expression match whitespace or the start of a string?

I'm trying to replace currency the abbreviation GBP with a £ symbol. I could just match anything starting GBP, but I'd like to be a bit more conservative, and look for certain delimiters around it.

>>> import re
>>> text = u'GBP 5 Off when you spend GBP75.00'

>>> re.sub(ur'GBP([\W\d])', ur'£\g<1>', text) # matches GBP with any prefix
u'\xa3 5 Off when you spend \xa375.00'

>>> re.sub(ur'^GBP([\W\d])', ur'£\g<1>', text) # matches at start only
u'\xa3 5 Off when you spend GBP75.00'

>>> re.sub(ur'(\W)GBP([\W\d])', ur'\g<1>£\g<2>', text) # matches whitespace prefix only
u'GBP 5 Off when you spend \xa375.00'

Can I do both of the latter examples at the same time?

标签： python regex

8条回答

混吃等死

2楼-- · 2020-01-31 00:32

This replaces GBP if it's preceded by the start of a string or a word boundary (which the start of a string already is), and after GBP comes a numeric value or a word boundary:

re.sub(u'\bGBP(?=\b|\d)', u'£', text)

This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough?

0人赞添加讨论(0) 举报

傲

3楼-- · 2020-01-31 00:32

I think you're looking for '(^|\W)GBP([\W\d])'

0人赞添加讨论(0) 举报

家丑人穷心不美

4楼-- · 2020-01-31 00:32

Yes, why not?

re.sub(u'^\W*GBP...

matches the start of the string, 0 or more whitespaces, then GBP...

edit: Oh, I think you want alternation, use the |:

re.sub(u'(^|\W)GBP...

0人赞添加讨论(0) 举报

劫难

5楼-- · 2020-01-31 00:38

You can always trim leading and trailing whitespace from the token before you search if it's not a matching/grouping situation that requires the full line.

0人赞添加讨论(0) 举报

来，给爷笑一个

6楼-- · 2020-01-31 00:40

Use the OR "|" operator:

>>> re.sub(r'(^|\W)GBP([\W\d])', u'\g<1>£\g<2>', text)
u'\xa3 5 Off when you spend \xa375.00'

0人赞添加讨论(0) 举报

我只想做你的唯一

7楼-- · 2020-01-31 00:45

A left-hand whitespace boundary - a position in the string that is either a string start or right after a whitespace character - can be expressed with

(?<!\S)   # A negative lookbehind requiring no non-whitespace char immediately to the left of the current position
(?<=\s|^) # A positive lookbehind requiring a whitespace or start of string immediately to the left of the current position
(?:\s|^)  # A non-capturing group matching either a whitespace or start of string 
(\s|^)    # A capturing group matching either a whitespace or start of string

See a regex demo. Python 3 demo:

import re
rx = r'(?<!\S)GBP([\W\d])'
text = 'GBP 5 Off when you spend GBP75.00'
print( re.sub(rx, r'£\1', text) )
# => £ 5 Off when you spend £75.00

Note you may use \1 instead of \g<1> in the replacement pattern since there is no need in an unambiguous backreference when it is not followed with a digit.

BONUS: A right-hand whitespace boundary can be expressed with the following patterns:

(?!\S)   # A negative lookahead requiring no non-whitespace char immediately to the right of the current position
(?=\s|$) # A positive lookahead requiring a whitespace or end of string immediately to the right of the current position
(?:\s|$)  # A non-capturing group matching either a whitespace or end of string 
(\s|$)    # A capturing group matching either a whitespace or end of string

0人赞添加讨论(0) 举报

1 2 下一页

Regular expression: match start or whitespace

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间