Using regex to extract URLs from plain text with P-第2页回答

Using regex to extract URLs from plain text with P

2019-01-23 20:40发布

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

It fails horribly and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that wouldn't happen because I am using .*?, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)

标签： regex perl url

7条回答

走好不送

2楼-- · 2019-01-23 21:16

i thought that shouldn't happen because i am using .*? which ought to be non-greedy and give me the smallest match

It does, but it gives you the smallest match going right. Starting from the first http and going right, that's the smallest match.

Please note for the future, you don't have to escape the slashes, because you don't have to use slashes as your separator. And you don't have to escape the colon either. Next time just do this:

m|(http://.*?homepage.com\/.*?\.gif)|

m#(http://.*?homepage.com\/.*?\.gif)#

m<(http://.*?homepage.com\/.*?\.gif)>

or one of lots of other characters, see the perlre documentation.

0人赞添加讨论(0) 举报

上一页 1 2

Using regex to extract URLs from plain text with P

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间