Extracting URL link using regular expression re -

2019-05-27 02:40发布

问题:

I've been trying to extract URLs from a text file using re api. any link that starts with http:// , https:// and www.

the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging. I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL. any help is appreciated, because I'm not familiar with string matching at all myself

here is the signature

sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))
sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))

examples:

http://www.website.com/science/</span></a><o:p></o:p></span></div><div
www.website.com/library/</span></a></span></i><span
http://awebsite.com/Groups</a><div>

re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

The [^\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>

Extracting URL link using regular expression re -

问题:

回答1:

收藏的人(0)

Extracting URL link using regular expression re -

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮