I'm trying to use sed to clean up lines of URLs to extract just the domain..
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
(either with or without the trainling slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
Non greedy matching
Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it. Lets say you want to kill everything till the very first occurrence of
HELLO
. You cannot say[^HELLO]
...So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say
top_sekrit
.In this case we can:
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
don bother, i got it on another forum :)
I realize this is an old entry, but someone may find it useful. As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}
Simulating lazy (un-greedy) quantifier in
sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using
-r
option)Regex:
Sed:
Example (finding first sequence of digits) Live demo:
How does it work?
This regex benefits from an alternation
|
. At each position engine will look for the first side of alternation (our target) and if it is not matched second side of alternation which has a dot.
matches the next immediate character.Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched
(EXPRESSION)
rest of line is consumed immediately as well.*
. We now hold our value in the first capturing group.POSIX BRE
Regex:
Sed:
Example (finding first sequence of digits):
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since
*
means more or zero it skips over second capturing group\(\([0-9]\{1,\}\).*\)*
and arrives at a dot.
to match a single character and this process continues.Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
Input string:
-EDE:
end
-SDE:
start
Output:
First regex
\(end\).*
matches and captures first end delimiterend
and substitues all match with recent captured characters which is the end delimiter. At this stage our output is:foobar start block #1 end
.Then the result is passed to second regex
\(\(start.*\)*.\)*
that is same as POSIX BRE version above. It matches a single character if start delimiterstart
is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE:
[^:/]\/
SDE:
http:
Usage:
Output: