Extract all URLs that start with http or https and

2019-06-01 00:32发布

I would like to extract every link that starts with http:// (not sure if I have https:// inside) and ends with .html from a text file using grep command.

Problem that I have is that file is too big and there are a lot of links...

I tried this:

grep "/http:\/\/.*?\.html/"  filename.txt > newFile.txt

but I get an empty file, just like with this:

grep -Eo "(http|https)://[a-zA-Z0-9]./(html)" filename.txt > newFile.txt

Can anyone help me?

Just to be sure that we are on the same track, I want to extract all links to new file and have them 1 per line.

Thank you.

Best regards

标签： html regex http url grep

2条回答

冷血范

2楼-- · 2019-06-01 01:10

You can use:

grep -Eo "https?://\S+?\.html" filename.txt > newFile.txt

This will match 1 or more non-space character after https:// and before .html

0人赞添加讨论(0) 举报

放我归山

3楼-- · 2019-06-01 01:22

This work for me:

grep -oE '(http|https)://(.*).html' filename.txt > newFile.txt

but, if we have two links in one line we take both this links in one line

http://site1.com/1.html</a>tralala<a href="http://site2.com/2.html

0人赞添加讨论(0) 举报

Extract all URLs that start with http or https and

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间