Using regex to extract URLs from plain text with P

2019-01-23 20:40发布

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

It fails horribly and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that wouldn't happen because I am using .*?, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)

标签: regex perl url
7条回答
姐就是有狂的资本
2楼-- · 2019-01-23 20:53
https?\:\/\/[^\s]+[\/\w]

This regex worked for me

查看更多
Summer. ? 凉城
3楼-- · 2019-01-23 20:53

Here is a regex to (hopefully) get|extract|obtain all URLs from string|text file, that seems to be working for me:

m,(http.*?://([^\s)\"](?!ttp:))+),g

... or in an example:

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -ne 'while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'


a blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah "https://poi.com/a%20b"; (http://bbb.comhttp://roch.com/abc) 

http://www.abc.com/dss.htm?a=1&p=2#chk
https://poi.com/a%20b
http://bbb.com
http://roch.com/abc

For my noob reference, here is the debug version of the same command above:

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -dne 'use re "debug" ; while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'

The regex matches on http(s):// - and uses whitespace, " and ) as "exit" characters; then uses positive lookahead to, initially, cause an "exit" on "http" literal group (if a match is already in progress); however, since that also "eats" the last character of previous match, here the lookahead match is moved one character forward to "ttp:".

Some useful pages:

Hope this helps someone,
Cheers!

EDIT: Ups, just found about URI::Find::Simple - search.cpan.org, seems to do the same thing (via regex - Getting the website title from a link in a string)

查看更多
beautiful°
4楼-- · 2019-01-23 20:55

URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.

UPDATE: Recently updated to handle Unicode.

查看更多
我命由我不由天
5楼-- · 2019-01-23 21:00

Visit CPAN: Regexp::Common::URI

Edit: Even if you don't want a canned regular expression, it may help you to look at the source of a tested module that works.

If you want to find URLs that match a certain string, you can easily use this module to do that.

#!/usr/bin/env perl
use strict;
use warnings;
use Regexp::Common qw/URI/;

while (<>) {
  if (m/$RE{URI}{HTTP}{-keep}/) {
    print $_ if $1 =~ m/what-you-want/;
  }
}
查看更多
Deceive 欺骗
6楼-- · 2019-01-23 21:00

I have used following code to extract the links which ends with specific extension
like *.htm, *.html, *.gif, *.jpeg. Note: In this script extension *.html is written first and then *.htm because both have "htm" in common. So these kind of changes should be done carefully.

Input: File name having links and Output file name where results will be saved.
Output: Will be saved in output file.

Code goes here:

use strict;
use warnings;

if ( $#ARGV != 1 ) {
print
"Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
die $!;
}
open FILE_LINKS, $ARGV[0] or die $!;
open FILE_RESULT, ">$ARGV[1]" or die $!;

my @Links;
foreach (<FILE_LINKS>) {
    my @tempArray;
    my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
    for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
        push( @Links, $Matches[$i] );
        }
    }
print FILE_RESULT join( "\n", @Links );

Output of your string is here:

http://homepage.com/woot.gif
http://shomepage.com/woot.gif
查看更多
Root(大扎)
7楼-- · 2019-01-23 21:13

URLs aren't allowed to contain spaces, so instead of .*? you should use \S*?, for zero-or-more non-space characters.

查看更多
登录 后发表回答