Non greedy (reluctant) regex matching in sed?

2018-12-31 05:02发布

I'm trying to use sed to clean up lines of URLs to extract just the domain..

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trainling slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.

20条回答
永恒的永恒
2楼-- · 2018-12-31 05:25

There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:

sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
  • -r: Use extended regex (for + and unescaped parenthesis)
  • ":loop": Define a new label named "loop"
  • -e: add commands to sed
  • "t loop": Jump back to label "loop" if there was a successful substitution

The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:

-e "s,$,/,"
查看更多
谁念西风独自凉
3楼-- · 2018-12-31 05:25

Another sed version:

sed 's|/[:alphanum:].*||' file.txt

It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)

查看更多
有味是清欢
4楼-- · 2018-12-31 05:26

Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:

echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"

If you're not familiar with grouping, start here.

查看更多
墨雨无痕
5楼-- · 2018-12-31 05:28

This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:

$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV

should become this output:

ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

To do that you convert foo and bar to individual characters and then use the negation of those characters between them:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

In the above:

  1. s/@/@A/g; s/{/@B/g; s/}/@C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
  2. s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
  3. s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
  4. s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
  5. s/@C/}/g; s/@B/{/g; s/@A/@/g is converting the placeholder strings back to their original characters.

Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV
查看更多
时光乱了年华
6楼-- · 2018-12-31 05:29

With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'

Output:

http://www.suon.co.uk

this is:

  • don't output -n
  • search, match pattern, replace and print s/<pattern>/<replace>/p
  • use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
  • remember match between brackets \( ... \), later accessible with \1,\2...
  • match http://
  • followed by anything in brackets [], [ab/] would mean either a or b or /
  • first ^ in [] means not, so followed by anything but the thing in the []
  • so [^/] means anything except / character
  • * is to repeat previous group so [^/]* means characters except /.
  • so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
  • we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
  • now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'

If you want to include backslash after the domain as well, then add one more backslash in the group to remember:

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'

output:

http://www.suon.co.uk/
查看更多
牵手、夕阳
7楼-- · 2018-12-31 05:30

sed does not support "non greedy" operator.

You have to use "[]" operator to exclude "/" from match.

sed 's,\(http://[^/]*\)/.*,\1,'

P.S. there is no need to backslash "/".

查看更多
登录 后发表回答