Using regex to extract URLs from plain text with P

2019-01-23 20:59发布

问题:

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

It fails horribly and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that wouldn't happen because I am using .*?, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)

回答1:

URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.

UPDATE: Recently updated to handle Unicode.



回答2:

Visit CPAN: Regexp::Common::URI

Edit: Even if you don't want a canned regular expression, it may help you to look at the source of a tested module that works.

If you want to find URLs that match a certain string, you can easily use this module to do that.

#!/usr/bin/env perl
use strict;
use warnings;
use Regexp::Common qw/URI/;

while (<>) {
  if (m/$RE{URI}{HTTP}{-keep}/) {
    print $_ if $1 =~ m/what-you-want/;
  }
}


回答3:

I have used following code to extract the links which ends with specific extension
like *.htm, *.html, *.gif, *.jpeg. Note: In this script extension *.html is written first and then *.htm because both have "htm" in common. So these kind of changes should be done carefully.

Input: File name having links and Output file name where results will be saved.
Output: Will be saved in output file.

Code goes here:

use strict;
use warnings;

if ( $#ARGV != 1 ) {
print
"Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
die $!;
}
open FILE_LINKS, $ARGV[0] or die $!;
open FILE_RESULT, ">$ARGV[1]" or die $!;

my @Links;
foreach (<FILE_LINKS>) {
    my @tempArray;
    my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
    for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
        push( @Links, $Matches[$i] );
        }
    }
print FILE_RESULT join( "\n", @Links );

Output of your string is here:

http://homepage.com/woot.gif
http://shomepage.com/woot.gif


回答4:

URLs aren't allowed to contain spaces, so instead of .*? you should use \S*?, for zero-or-more non-space characters.



回答5:

https?\:\/\/[^\s]+[\/\w]

This regex worked for me



回答6:

i thought that shouldn't happen because i am using .*? which ought to be non-greedy and give me the smallest match

It does, but it gives you the smallest match going right. Starting from the first http and going right, that's the smallest match.

Please note for the future, you don't have to escape the slashes, because you don't have to use slashes as your separator. And you don't have to escape the colon either. Next time just do this:

m|(http://.*?homepage.com\/.*?\.gif)|

or

m#(http://.*?homepage.com\/.*?\.gif)#

or

m<(http://.*?homepage.com\/.*?\.gif)>

or one of lots of other characters, see the perlre documentation.



回答7:

Here is a regex to (hopefully) get|extract|obtain all URLs from string|text file, that seems to be working for me:

m,(http.*?://([^\s)\"](?!ttp:))+),g

... or in an example:

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -ne 'while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'


a blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah "https://poi.com/a%20b"; (http://bbb.comhttp://roch.com/abc) 

http://www.abc.com/dss.htm?a=1&p=2#chk
https://poi.com/a%20b
http://bbb.com
http://roch.com/abc

For my noob reference, here is the debug version of the same command above:

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -dne 'use re "debug" ; while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'

The regex matches on http(s):// - and uses whitespace, " and ) as "exit" characters; then uses positive lookahead to, initially, cause an "exit" on "http" literal group (if a match is already in progress); however, since that also "eats" the last character of previous match, here the lookahead match is moved one character forward to "ttp:".

Some useful pages:

  • perl: multiple matches on a single line? (edited for proper < > forma
  • regular expression negate a word (not character)
  • Perl Regular Expressions
  • Perl Text Patterns for Search and Replace (intro, $&, @- ... )

Hope this helps someone,
Cheers!

EDIT: Ups, just found about URI::Find::Simple - search.cpan.org, seems to do the same thing (via regex - Getting the website title from a link in a string)



标签: regex perl url