Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does. Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does. Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.
I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.
Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.
See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.
When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.
Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?