In regards to: Find Hyperlinks in Text using Python (twitter related)
How can I extract just the url so I can put it into a list/array?
Edit
Let me clarify, I don't want to parse the URL into pieces. I want to extract the URL from the text of the string to put it into an array. Thanks!
Don't forget to check for whether the search returns a value of
None
—I found the posts above helpful but wasted time dealing with aNone
result.See Python Regex "object has no attribute".
i.e.
You can use the following monstrous regex:
Demo regex101
This regex will accept urls in the following format:
INPUT:
OUTPUT:
Explanations:
\b
is used for word boundary to delimit the URL and the rest of the text(?:https?://)?
to match http:// or https// if present(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})
to match standard url (that might start withwww.
(lets call itSTANDARD_URL
)(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
to match standard Ipv4 (lets call itIPv4
)(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
(lets call itIPv6
)PORT
) if present:(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])
(?:/[\w\.-]*)*/?)
target object part of the url (html file, jpg,...) (lets call itRESSOURCE_PATH
)This gives the following regex:
Sources:
OUTPUT:
Misunderstood question:
or py2.* version:
ETA: regex are indeed are the best option here:
[note: Assuming you are using this on Twitter data (as indicated in question), the simplest way of doing this is to use their API, which returns the urls extracted from tweets as a field]
If you want to extract URLs from any text you can use my urlextract. It finds URL based on TLD found in text. It expands to both side from TLD position an gets whole URL. Its easy to use. Check it: https://github.com/lipoja/URLExtract
In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this: