How can I extract URLs from plain text with Perl?

I need the Perl regex to parse plain text input and convert all links to valid HTML HREF links. I've tried 10 different versions I found on the web but none of them seen to work correctly. I also tested other solutions posted on StackOverflow, none of which seem to work. The correct solution should be able to find any URL in the plain text input and convert it to:

<a href="$1">$1</a>

Some cases other regular expressions I tried didn't handle correctly include:

URLs at the end of a line which are followed by returns
URLs that included question marks
URLs that start with 'https'

I'm hoping that another Perl guy out there will already have a regular expression they are using for this that they can share. Thanks in advance for your help!

标签： regex perl url plaintext

4条回答

地球回转人心会变

2楼-- · 2019-07-13 03:17

Besides URI::Find, also checkout the big regular expression database: Regexp::Common, there is a Regexp::Common::URI module that gives you something as easy as:

my ($uri) = $str =~ /$RE{URI}{-keep}/;

If you want different pieces (hostname, query parameters etc) in that uri, see the doc of Regexp::Common::URI::http for what's captured in the $RE{URI} regular expression.

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2019-07-13 03:25

When I tried URI::Find::Schemeless with the following text:

Here is a URL  and one bare URL with 
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)

Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://user@example.org/test/me
How about one without a protocol www.example.com?

it messed up http://example.org/(9.3). So, I came up with the following with the help of Regexp::Common:

#!/usr/bin/perl

use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;

my $heuristic = URI::Find::Schemeless->schemeless_uri_re;

my $pattern = qr{
    $RE{URI}{HTTP}{-scheme=>'https?'} |
    $RE{URI}{FTP} |
    $heuristic
}x;

local $/ = '';

while ( my $par = <DATA> ) {
    chomp $par;
    $par =~ s/</&lt;/g;
    $par =~ s/( $pattern ) / linkify($1) /gex;
    print "<p>$par</p>\n";
}

sub linkify {
    my ($str) = @_;
    $str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
    $str = escapeHTML($str);
    sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}

This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3)).

0人赞添加讨论(0) 举报

Summer. ? 凉城

4楼-- · 2019-07-13 03:34

Here I have posted the sample code using how to extract the urls. Here it will take the lines from the stdin. And it will check whether the input line contains valid URL format. And it will give you the URL

use strict;
use warnings;

use Regexp::Common qw /URI/;

while (1)
{
        #getting the input from stdin.
        print "Enter the line: \n";
        my $line = <>;
        chomp ($line); #removing the unwanted new line character
        my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/       and  print "Contains an HTTP URI.\n";
        print "URL : $uri\n" if ($uri);
}

Sample output I am getting is as follows

Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com

0人赞添加讨论(0) 举报

成全新的幸福

5楼-- · 2019-07-13 03:36

You want URI::Find. Once you extract the links, you should be able to handle the rest of the problem just fine.

This is answered in perlfaq9's answer to "How do I extract URLs?", by the way. There is a lot of good stuff in those perlfaq. :)

0人赞添加讨论(0) 举报

How can I extract URLs from plain text with Perl?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间