How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:
my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}
It fails horribly and gives me:
http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif
I thought that wouldn't happen because I am using .*?
, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)
This regex worked for me
Here is a regex to (hopefully) get|extract|obtain all URLs from string|text file, that seems to be working for me:
... or in an example:
For my noob reference, here is the debug version of the same command above:
The regex matches on
http(s)://
- and uses whitespace,"
and)
as "exit" characters; then uses positive lookahead to, initially, cause an "exit" on "http
" literal group (if a match is already in progress); however, since that also "eats" the last character of previous match, here the lookahead match is moved one character forward to "ttp:
".Some useful pages:
$&
,@-
... )Hope this helps someone,
Cheers!
EDIT: Ups, just found about URI::Find::Simple - search.cpan.org, seems to do the same thing (via regex - Getting the website title from a link in a string)
URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.
UPDATE: Recently updated to handle Unicode.
Visit CPAN: Regexp::Common::URI
Edit: Even if you don't want a canned regular expression, it may help you to look at the source of a tested module that works.
If you want to find URLs that match a certain string, you can easily use this module to do that.
I have used following code to extract the links which ends with specific extension
like *.htm, *.html, *.gif, *.jpeg. Note: In this script extension *.html is written first and then *.htm because both have "htm" in common. So these kind of changes should be done carefully.
Input: File name having links and Output file name where results will be saved.
Output: Will be saved in output file.
Code goes here:
Output of your string is here:
URLs aren't allowed to contain spaces, so instead of .*? you should use \S*?, for zero-or-more non-space characters.