I'm trying to use sed to clean up lines of URLs to extract just the domain..
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
(either with or without the trainling slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.
There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
Another sed version:
It matches
/
followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
If you're not familiar with grouping, start here.
This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every
foo...bar
to<foo...bar>
so for example this input:should become this output:
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
In the above:
s/@/@A/g; s/{/@B/g; s/}/@C/g
is converting{
and}
to placeholder strings that cannot exist in the input so those chars then are available to convertfoo
andbar
to.s/foo/{/g; s/bar/}/g
is convertingfoo
andbar
to{
and}
respectivelys/{[^{}]*}/<&>/g
is performing the op we want - convertingfoo...bar
to<foo...bar>
s/}/bar/g; s/{/foo/g
is converting{
and}
back tofoo
andbar
.s/@C/}/g; s/@B/{/g; s/@A/@/g
is converting the placeholder strings back to their original characters.Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use
{[^{}]*}
as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
Output:
this is:
-n
s/<pattern>/<replace>/p
;
search command separator instead of/
to make it easier to type sos;<pattern>;<replace>;p
\(
...\)
, later accessible with\1
,\2
...http://
[]
,[ab/]
would mean eithera
orb
or/
^
in[]
meansnot
, so followed by anything but the thing in the[]
[^/]
means anything except/
character*
is to repeat previous group so[^/]*
means characters except/
.sed -n 's;\(http://[^/]*\)
means search and rememberhttp://
followed by any characters except/
and remember what you've found/
so add another/
at the end:sed -n 's;\(http://[^/]*\)/'
but we want to match the rest of the line after the domain so add.*
\1
) is the domain so replace matched line with stuff saved in group\1
and print:sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
output:
sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
P.S. there is no need to backslash "/".