I have a list of urls like this:
http://noto.zrobimystrone.pl/pucenter/images/NGdocs/
http://visionwebmkt.com/unsubscribe.php?M=879552&C=b744d324e38f5f3b0bcf549f1d57a3ab&L=20&N=497
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18982d71cd&L=1&N=3
http://www.contatoruy.in/link.php?M=86457&N=4&L=1&F=H
http://www.maxxivrimoveis.com.br/
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18982d71cd&L=1&N=2
http://arm.smilecire.com/ch+urch38146263923bpa.stor/imp-roved258021029his+health212149011
http://hurl.zonalrems.com/ge.tyo-ur584372780599hea+lth247408058un/der+control21211901
http://harp.doomyjupe.com/see.this-better/life+58291551346csexdrive663295668+better/how.981692016
http://beefy.toneyvaws.com/no+tice/how/35306640b+see/app=5429204last/attempt=457943182
http://kirk.yournjuju.com/shop/sam.sclub-win=ter/58387369768esame+673844946.bett.er-loo.k981686408
http://idly.theirpoem.com/veri-fy/notice-7853508818b2glob/al=who.43639603inc.lusion-610549278
http://wva188.suleacatan.com/credit-score/review/-551694841511001sfdghsfdgsdfg63887839
http://cop.forterins.com/app.lyto=face962540097dtolo+oko.ung268570307yo.un-ger8752507
http://vni116.gaelsyaray.com/qertqetert//-dghjghjghd5531864856415612229498430
http://ticket.prategama.com/shop/sam.sclub-win=ter/752490935same+226373195.bett.er-loo.k212801
http://cbu125.quetxviii.com/cvbnvbn7551116db537203--swrtytry664896546
http://c5a.dicadodia.com.br/pass4sp09/NetAffProTeste-1.html
http://snub.woadsbevy.com/ama/zing-753773417oppe-tun/ity+217801.is-here/now=236922473
http://mkt.livrariacultura.com.br/pub/cc?_ri_=X0Gzc2X%3DWQpglLjHJlYQGgzfB7tPi0PuyyJ71ES
I wanna extract only the parents domain names, for example:
http://noto.zrobimystrone.pl/pucenter/images/NGdocs/
http://visionwebmkt.com/unsubscribe.php?M=879552&C=b744d324e38f5f3b0bcf549f1d57a3ab&L=20&N=497
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18
Into
zrobimystrone.pl
visionwebmkt.com
meguiatramandai.com.br
I have tried
awk '{gsub("http://|/.*","")}1' list.txt
and got the following results:
noto.zrobimystrone.pl
visionwebmkt.com
www.meguiatramandai.com.br
www.contatoruy.in
www.maxxivrimoveis.com.br
www.meguiatramandai.com.br
arm.smilecire.com
hurl.zonalrems.com
harp.doomyjupe.com
beefy.toneyvaws.com
but dont know how to get only the parent name from noto.zrobimystrone.pl for instance.
Using awk
awk -F \/ '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." file|sort -u
contatoruy.in
dicadodia.com.br
doomyjupe.com
forterins.com
gaelsyaray.com
livrariacultura.com.br
maxxivrimoveis.com.br
meguiatramandai.com.br
prategama.com
quetxviii.com
smilecire.com
suleacatan.com
theirpoem.com
toneyvaws.com
visionwebmkt.com
woadsbevy.com
yournjuju.com
zonalrems.com
zrobimystrone.pl
You can use this awk:
awk -F'.' '{gsub("http://|/.*","")} NF>2{$1="";$0=substr($0, 2)}1' OFS='.' list.txt
zrobimystrone.pl
visionwebmkt.com
meguiatramandai.com.br
contatoruy.in
maxxivrimoveis.com.br
meguiatramandai.com.br
smilecire.com
zonalrems.com
doomyjupe.com
toneyvaws.com
yournjuju.com
theirpoem.com
suleacatan.com
forterins.com
gaelsyaray.com
prategama.com
quetxviii.com
dicadodia.com.br
woadsbevy.com
livrariacultura.com.br
A "simple" bash solution. Tested in bash shell on Solaris 11.2 x86.
#!/bin/bash
while IFS=/ read HTTP NULL FQDN PAGE
do
PARENT=${FQDN#*.}
if [[ $PARENT != *"."* ]]
then echo $FQDN
else echo $PARENT
fi
done < fileOfURLs.txt
Without the string contains pattern test, too much of the domain could be stripped away. The if paragraph can be reduced,so the whole script now looks like this:
#!/bin/bash
while IFS=/ read HTTP NULL FQDN PAGE
do
PARENT=${FQDN#*.}
[[ $PARENT != *"."* ]] && echo $FQDN || echo $PARENT
done < fileOfURLs.txt
The bash variable substitution is taking the contents of the variable FQDN and stripping from the left any character up to and including the first dot.
The test condition is asking if the contents of the PARENT variable does not contain a dot. If it does not hold a dot somewhere in the value, the test evaluates to true and will display the original FQDN contents. If the test evaluates to false, (there is still a dot in the value) the contents of PARENT are displayed.
I guess it depends on what you mean by parent. If by "parent", you mean the top of the zone apex in DNS (e.g., zrobimystrone.pl ), then the right way to do this is to look that up in DNS. There's a trick with DNS where you get back the parent zone SOA record if you ask for the SOA for any name.. So, try this:
for i in $(awk '{gsub("http://|/.*","")}1' list.txt); do dig soa $i | grep -v ^\; | grep SOA | awk '{print $1}'; done
This will give you a much more accurate list, but it runs way slower and is sub-optimal. The other answers don't take into account all the possible variations of TLD names used within TLDs, e.g., www.somecompany.org.uk, so it all depends on how accurate you need this to be.
An easy solution to get parent domain name
echo http://www.humkinar.pk | awk -F '/' '{print $3}'
www.humkinar.pk