Gruber’s URL Regular Expression in Python

2020-07-11 10:05发布

How do I rewrite this new way to recognise addresses to work in Python?

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

3条回答
不美不萌又怎样
2楼-- · 2020-07-11 10:32

I don't think python have this expression

[:punct:]

Wikipedia says [:punct:] is same to

[-!\"#$%&\'()*+,./:;<=>?@\\[\\\\]^_`{|}~]
查看更多
不美不萌又怎样
3楼-- · 2020-07-11 10:42

The original source for that states "This pattern should work in most modern regex implementations" and specifically Perl. Python's regex implementation is modern and similar to Perl's but is missing the [:punct:] character class. You can easily build that using this:

>>> import string, re
>>> pat = r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^%s\s]|/)))'
>>> pat = pat % re.sub(r'([-\\\]])', r'\\\1', string.punctuation)

The re.sub() call escapes certain characters inside the character set as required.

Edit: Using re.escape() works just as well, since it just sticks a backslash in front of everything. That felt crude to me at first, but certainly works fine for this case.

>>> pat = pat % re.escape(string.punctuation)
查看更多
该账号已被封号
4楼-- · 2020-07-11 10:43

Python doesn't have the POSIX bracket expressions.

The [:punct:] bracket expression is equivalent in ASCII to

[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] 
查看更多
登录 后发表回答