I'm trying to use sed to clean up lines of URLs to extract just the domain..
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
(either with or without the trainling slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.
another way, not using regex, is to use fields/delimiter method eg
sed
certainly has its place but this not not one of them !As Dee has pointed out: Just use
cut
. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:gives you:
As you can see this is a lot more flexible approach.
(all credit to Dee)
sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1|
works tooTry
[^/]*
instead of.*?
:Here is something you can do with a two step approach and awk:
Hope that helps!