Non greedy (reluctant) regex matching in sed?

2018-12-31 05:02发布

I'm trying to use sed to clean up lines of URLs to extract just the domain..

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trainling slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.

20条回答
宁负流年不负卿
2楼-- · 2018-12-31 05:19

another way, not using regex, is to use fields/delimiter method eg

string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
查看更多
临风纵饮
3楼-- · 2018-12-31 05:20
sed 's|(http:\/\/[^\/]+\/).*|\1|'
查看更多
旧时光的记忆
4楼-- · 2018-12-31 05:21

sed certainly has its place but this not not one of them !

As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:

url="http://www.suepearson.co.uk/product/174/71/3816/"

protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)

gives you:

protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"

As you can see this is a lot more flexible approach.

(all credit to Dee)

查看更多
伤终究还是伤i
5楼-- · 2018-12-31 05:22

sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too

查看更多
其实,你不懂
6楼-- · 2018-12-31 05:23

Try [^/]* instead of .*?:

sed 's|\(http://[^/]*/\).*|\1|g'
查看更多
明月照影归
7楼-- · 2018-12-31 05:23

Here is something you can do with a two step approach and awk:

A=http://www.suepearson.co.uk/product/174/71/3816/  
echo $A|awk '  
{  
  var=gensub(///,"||",3,$0) ;  
  sub(/\|\|.*/,"",var);  
  print var  
}'  

Output: http://www.suepearson.co.uk

Hope that helps!

查看更多
登录 后发表回答