Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?
Simple question, but nothing jumped out on Google (or Stackoverflow).
Look forward to seeing how y'all do this!
There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:
and then you can use it like this:
You can find more info on my github page: https://github.com/lipoja/URLExtract
NOTE: It downloads list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then its not for you.
This approach is similar as in urlextractor (mentioned above), but my code is recent, maintained and I am open for any suggestions (new features).
You can use this library I wrote:
https://github.com/imranghory/urlextractor
It's extremely hacky, but it doesn't rely upon "http://" like many other techniques, rather it uses the Mozilla TLD list (via the tldextract library) to search for TLDs (i.e ".co.uk", ".com", etc.) in the text and then attempts to construct urls around the TLD.
It doesn't aim to be RFC compliant but rather accurate for how urls are used in practice in the real world. So for example it will reject the technically valid domain "com" (you can actually use a TLD as a domain; although it's rare in practice) and will strip trail full-stops or commas from urls.
You can use BeautifulSoup.
Note that the solution with regexes is faster, although will not be as accurate.
if you know that there is a URL following a space in the string you can do something like this:
s is the string containg the url
otherwise you need to check if find returns -1 or not.
I'm late to the party, but here is a solution someone from #python on freenode suggested to me. It avoids the regex hassle.
I know that it's exactly what you do not want but here's a file with a huge regex:
I call that file
urlmarker.py
and when I need it I just import it, eg.cf. http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Also, here is what Django (1.6) uses to validate
URLField
s:cf. https://github.com/django/django/blob/1.6/django/core/validators.py#L43-50
Django 1.9 has that logic split across a few classes