Extract parent domain name from a list of url thro

2020-07-23 04:49发布

问题:

I have a list of urls like this:

http://noto.zrobimystrone.pl/pucenter/images/NGdocs/
http://visionwebmkt.com/unsubscribe.php?M=879552&C=b744d324e38f5f3b0bcf549f1d57a3ab&L=20&N=497
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18982d71cd&L=1&N=3
http://www.contatoruy.in/link.php?M=86457&N=4&L=1&F=H
http://www.maxxivrimoveis.com.br/
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18982d71cd&L=1&N=2
http://arm.smilecire.com/ch+urch38146263923bpa.stor/imp-roved258021029his+health212149011
http://hurl.zonalrems.com/ge.tyo-ur584372780599hea+lth247408058un/der+control21211901
http://harp.doomyjupe.com/see.this-better/life+58291551346csexdrive663295668+better/how.981692016
http://beefy.toneyvaws.com/no+tice/how/35306640b+see/app=5429204last/attempt=457943182
http://kirk.yournjuju.com/shop/sam.sclub-win=ter/58387369768esame+673844946.bett.er-loo.k981686408
http://idly.theirpoem.com/veri-fy/notice-7853508818b2glob/al=who.43639603inc.lusion-610549278
http://wva188.suleacatan.com/credit-score/review/-551694841511001sfdghsfdgsdfg63887839
http://cop.forterins.com/app.lyto=face962540097dtolo+oko.ung268570307yo.un-ger8752507
http://vni116.gaelsyaray.com/qertqetert//-dghjghjghd5531864856415612229498430
http://ticket.prategama.com/shop/sam.sclub-win=ter/752490935same+226373195.bett.er-loo.k212801
http://cbu125.quetxviii.com/cvbnvbn7551116db537203--swrtytry664896546
http://c5a.dicadodia.com.br/pass4sp09/NetAffProTeste-1.html
http://snub.woadsbevy.com/ama/zing-753773417oppe-tun/ity+217801.is-here/now=236922473
http://mkt.livrariacultura.com.br/pub/cc?_ri_=X0Gzc2X%3DWQpglLjHJlYQGgzfB7tPi0PuyyJ71ES

I wanna extract only the parents domain names, for example:

http://noto.zrobimystrone.pl/pucenter/images/NGdocs/
http://visionwebmkt.com/unsubscribe.php?M=879552&C=b744d324e38f5f3b0bcf549f1d57a3ab&L=20&N=497
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18

Into

zrobimystrone.pl
visionwebmkt.com
meguiatramandai.com.br

I have tried

awk '{gsub("http://|/.*","")}1' list.txt

and got the following results:

noto.zrobimystrone.pl
visionwebmkt.com
www.meguiatramandai.com.br
www.contatoruy.in
www.maxxivrimoveis.com.br
www.meguiatramandai.com.br
arm.smilecire.com
hurl.zonalrems.com
harp.doomyjupe.com
beefy.toneyvaws.com

but dont know how to get only the parent name from noto.zrobimystrone.pl for instance.

回答1:

Using awk

awk -F \/ '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." file|sort -u

contatoruy.in
dicadodia.com.br
doomyjupe.com
forterins.com
gaelsyaray.com
livrariacultura.com.br
maxxivrimoveis.com.br
meguiatramandai.com.br
prategama.com
quetxviii.com
smilecire.com
suleacatan.com
theirpoem.com
toneyvaws.com
visionwebmkt.com
woadsbevy.com
yournjuju.com
zonalrems.com
zrobimystrone.pl


回答2:

You can use this awk:

awk -F'.' '{gsub("http://|/.*","")} NF>2{$1="";$0=substr($0, 2)}1' OFS='.' list.txt
zrobimystrone.pl
visionwebmkt.com
meguiatramandai.com.br
contatoruy.in
maxxivrimoveis.com.br
meguiatramandai.com.br
smilecire.com
zonalrems.com
doomyjupe.com
toneyvaws.com
yournjuju.com
theirpoem.com
suleacatan.com
forterins.com
gaelsyaray.com
prategama.com
quetxviii.com
dicadodia.com.br
woadsbevy.com
livrariacultura.com.br


回答3:

A "simple" bash solution. Tested in bash shell on Solaris 11.2 x86.

#!/bin/bash
while IFS=/ read HTTP NULL FQDN PAGE
do
    PARENT=${FQDN#*.}
    if [[ $PARENT != *"."* ]]
        then echo $FQDN
        else echo $PARENT
    fi
done < fileOfURLs.txt

Without the string contains pattern test, too much of the domain could be stripped away. The if paragraph can be reduced,so the whole script now looks like this:

#!/bin/bash
while IFS=/ read HTTP NULL FQDN PAGE
do
    PARENT=${FQDN#*.}
    [[ $PARENT != *"."* ]] && echo $FQDN || echo $PARENT
done < fileOfURLs.txt

The bash variable substitution is taking the contents of the variable FQDN and stripping from the left any character up to and including the first dot.

The test condition is asking if the contents of the PARENT variable does not contain a dot. If it does not hold a dot somewhere in the value, the test evaluates to true and will display the original FQDN contents. If the test evaluates to false, (there is still a dot in the value) the contents of PARENT are displayed.



回答4:

I guess it depends on what you mean by parent. If by "parent", you mean the top of the zone apex in DNS (e.g., zrobimystrone.pl ), then the right way to do this is to look that up in DNS. There's a trick with DNS where you get back the parent zone SOA record if you ask for the SOA for any name.. So, try this:

for i in $(awk '{gsub("http://|/.*","")}1' list.txt); do dig soa $i | grep -v ^\; | grep SOA | awk '{print $1}'; done

This will give you a much more accurate list, but it runs way slower and is sub-optimal. The other answers don't take into account all the possible variations of TLD names used within TLDs, e.g., www.somecompany.org.uk, so it all depends on how accurate you need this to be.



回答5:

An easy solution to get parent domain name

echo http://www.humkinar.pk | awk -F '/' '{print $3}'
www.humkinar.pk