How can I extract URLs from plain text in Perl?

2019-05-13 12:55发布

I've seen some posts like this, but not exactly what I want to do.

How can I extract and delete URL links, and then remove them from plain text.

Example:

"Hello!!, I love http://www.google.es".

I want extract the "http://www.google.es", save it on a variable, and then, remove it from my text.

Finally, the text has to be like that:

"Hello!!, I love".

The URLs usually are the last "word" of the text, but not always.

4条回答
我欲成王,谁敢阻挡
2楼-- · 2019-05-13 13:26

Perhaps you want URI::Find, which can find URIs in arbitrary text. The return value from the code reference you give it produces the replacement string for the URL, so you can just return the empty string if you merely want to get rid of the URIs:

use URI::Find;

my $string = do { local $/; <DATA> };

my $finder = URI::Find->new( sub { '' } );
$finder->find(\$string );

print $string;

__END__
This has a mailto:joe@example.com
Go to http://www.google.com
Pay at https://paypal.com
From ftp://ftp.cpan.org download a file
查看更多
Deceive 欺骗
3楼-- · 2019-05-13 13:30
  • You can use URI::Find to extract URLs from an arbitrary text document.
  • or use Regexp::Common::URI- provide patterns for URIs.

    use strict;
    use warning;
    use Regexp::Common qw/URI/;
    my $str = "Hello!!, I love http://www.google.es";
    my ($uri) = $str =~ /$RE{URI}{-keep}/;
    print "$uri\n"; #output: http://www.google.es
    
查看更多
\"骚年 ilove
4楼-- · 2019-05-13 13:31

If Perl is not a must

$ cat  file
"Hello!!, I love http://www.google.es".
this is another link http://www.somewhere.com
this if ftp link ftp://www.anywhere.com the end

$ awk '{gsub(/(http|ftp):\/\/.[^" ]*/,"") }1'  file
"Hello!!, I love ".
this is another link
this if ftp link  the end

Of course, you can also adapt the regex to Perl if you like

查看更多
疯言疯语
5楼-- · 2019-05-13 13:37

This works for me for 99% of the cases, sure there are edge cases, but for my needs it's good enough:

/((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*)/mg

https://regex101.com/r/fO6mX3/2

查看更多
登录 后发表回答