Non greedy (reluctant) regex matching in sed?

2018-12-31 05:02发布

I'm trying to use sed to clean up lines of URLs to extract just the domain..

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trainling slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.

20条回答
无与为乐者.
2楼-- · 2018-12-31 05:30

sed - non greedy matching by Christoph Sieghart

The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:

Greedy matching

% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar

Non greedy matching

% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar
查看更多
梦醉为红颜
3楼-- · 2018-12-31 05:31

Non-greedy solution for more than a single character

This thread is really old but I assume people still needs it. Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...

So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.

In this case we can:

s/HELLO/top_sekrit/     #will only replace the very first occurrence
s/.*top_sekrit//        #kill everything till end of the first HELLO

Of course, with a simpler input you could use a smaller word, or maybe even a single character.

HTH!

查看更多
情到深处是孤独
4楼-- · 2018-12-31 05:35

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:

perl -pe 's|(http://.*?/).*|\1|'
查看更多
看淡一切
5楼-- · 2018-12-31 05:38
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'

don bother, i got it on another forum :)

查看更多
妖精总统
6楼-- · 2018-12-31 05:39

I realize this is an old entry, but someone may find it useful. As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}

查看更多
妖精总统
7楼-- · 2018-12-31 05:42

Simulating lazy (un-greedy) quantifier in sed

And all other regex flavors!

  1. Finding first occurrence of an expression:

    • POSIX ERE (using -r option)

      Regex:

      (EXPRESSION).*|.
      

      Sed:

      sed -r "s/(EXPRESSION).*|./\1/g" # Global `g` modifier should be on
      

      Example (finding first sequence of digits) Live demo:

      $ sed -r "s/([0-9]+).*|./\1/g" <<< "foo 12 bar 34"
      
      12
      

      How does it work?

      This regex benefits from an alternation |. At each position engine will look for the first side of alternation (our target) and if it is not matched second side of alternation which has a dot . matches the next immediate character.

      enter image description here

      Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.

    • POSIX BRE

      Regex:

      \(\(\(EXPRESSION\).*\)*.\)*
      

      Sed:

      sed "s/\(\(\(EXPRESSION\).*\)*.\)*/\3/"
      

      Example (finding first sequence of digits):

      $ sed "s/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/" <<< "foo 12 bar 34"
      
      12
      

      This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.

      enter image description here

      If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.

  2. Finding first occurrence of a delimited expression:

    This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.

    sed "s/\(END-DELIMITER-EXPRESSION\).*/\1/; \
         s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g"
    

    Input string:

    foobar start block #1 end barfoo start block #2 end
    

    -EDE: end

    -SDE: start

    $ sed "s/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g"
    

    Output:

    start block #1 end
    

    First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end.

    enter image description here

    Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.

    enter image description here


Directly answering your question

Using approach #2 (delimited expression) you should select two appropriate expressions:

  • EDE: [^:/]\/

  • SDE: http:

Usage:

$ sed "s/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/" <<< "http://www.suepearson.co.uk/product/174/71/3816/"

Output:

http://www.suepearson.co.uk/
查看更多
登录 后发表回答