Non greedy (reluctant) regex matching in sed?-第3页回答

I'm trying to use sed to clean up lines of URLs to extract just the domain..

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trainling slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.

标签： regex sed pcre greedy regex-greedy

20条回答

无与为乐者.

2楼-- · 2018-12-31 05:30

sed - non greedy matching by Christoph Sieghart

The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:

Greedy matching

% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar

Non greedy matching

% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar

0人赞添加讨论(0) 举报

梦醉为红颜

3楼-- · 2018-12-31 05:31

Non-greedy solution for more than a single character

This thread is really old but I assume people still needs it. Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...

So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.

In this case we can:

s/HELLO/top_sekrit/     #will only replace the very first occurrence
s/.*top_sekrit//        #kill everything till end of the first HELLO

Of course, with a simpler input you could use a smaller word, or maybe even a single character.

HTH!

0人赞添加讨论(0) 举报

情到深处是孤独

4楼-- · 2018-12-31 05:35

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:

perl -pe 's|(http://.*?/).*|\1|'

0人赞添加讨论(0) 举报

看淡一切

5楼-- · 2018-12-31 05:38

echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'

don bother, i got it on another forum :)

0人赞添加讨论(0) 举报

妖精总统

6楼-- · 2018-12-31 05:39

I realize this is an old entry, but someone may find it useful. As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}

0人赞添加讨论(0) 举报

妖精总统

7楼-- · 2018-12-31 05:42

Simulating lazy (un-greedy) quantifier in `sed`

And all other regex flavors!

Finding first occurrence of an expression:
- POSIX ERE (using -r option)
  
  Regex:
```
(EXPRESSION).*|.
```
  Sed:
```
sed -r "s/(EXPRESSION).*|./\1/g" # Global `g` modifier should be on
```
  Example (finding first sequence of digits) Live demo:
```
$ sed -r "s/([0-9]+).*|./\1/g" <<< "foo 12 bar 34"
```
```
12
```
  How does it work?
  
  This regex benefits from an alternation |. At each position engine will look for the first side of alternation (our target) and if it is not matched second side of alternation which has a dot . matches the next immediate character.
  
  Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
- POSIX BRE
  
  Regex:
```
$\(\(EXPRESSION$.*\)*.\)*
```
  Sed:
```
sed "s/$\(\(EXPRESSION$.*\)*.\)*/\3/"
```
  Example (finding first sequence of digits):
```
$ sed "s/$\(\([0-9]\{1,\}$.*\)*.\)*/\3/" <<< "foo 12 bar 34"
```
```
12
```
  This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
  
  If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group $\([0-9]\{1,\}$.*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:

This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
```
sed "s/$END-DELIMITER-EXPRESSION$.*/\1/; \
     s/$\(START-DELIMITER-EXPRESSION.*$*.\)*/\1/g"
```
Input string:
```
foobar start block #1 end barfoo start block #2 end
```
-EDE: end

-SDE: start
```
$ sed "s/$end$.*/\1/; s/$\(start.*$*.\)*/\1/g"
```
Output:
```
start block #1 end
```
First regex $end$.* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end.

Then the result is passed to second regex $\(start.*$*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.

Directly answering your question

Using approach #2 (delimited expression) you should select two appropriate expressions:

EDE: [^:/]\/
SDE: http:

Usage:

$ sed "s/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/" <<< "http://www.suepearson.co.uk/product/174/71/3816/"

Output:

http://www.suepearson.co.uk/

0人赞添加讨论(0) 举报