Extract all URLs that start with http or https and

2019-06-01 00:32发布

I would like to extract every link that starts with http:// (not sure if I have https:// inside) and ends with .html from a text file using grep command.

Problem that I have is that file is too big and there are a lot of links...

I tried this:

grep "/http:\/\/.*?\.html/"  filename.txt > newFile.txt

but I get an empty file, just like with this:

grep -Eo "(http|https)://[a-zA-Z0-9]./(html)" filename.txt > newFile.txt

Can anyone help me?

Just to be sure that we are on the same track, I want to extract all links to new file and have them 1 per line.

Thank you.

Best regards

2条回答
冷血范
2楼-- · 2019-06-01 01:10

You can use:

grep -Eo "https?://\S+?\.html" filename.txt > newFile.txt

This will match 1 or more non-space character after https:// and before .html

查看更多
放我归山
3楼-- · 2019-06-01 01:22

This work for me:

grep -oE '(http|https)://(.*).html' filename.txt > newFile.txt

but, if we have two links in one line we take both this links in one line

http://site1.com/1.html</a>tralala<a href="http://site2.com/2.html
查看更多
登录 后发表回答