Get random site names in bash

2020-05-05 17:26发布

问题:

I'm making a script that calculate the distribution of the words in the web. What I have to do is check as many random web sites as I can and count the number of words in those sites, list them, and order them so that the word that occurs the majority of the times is a the top of the list. What I'm doing is generating random ip numbers:

a=`expr $RANDOM % 255`
let "a+=1"
b=`expr $RANDOM % 256`
c=`expr $RANDOM % 256`
d=`expr $RANDOM % 256`
ip=$a.$b.$c.$d

after that with nmap I check if the port 80 or 8080 is open on those sites so that there is a chance that it's a web site.

if I'm sure the ip doesn't belong to a web site I add the address to a black list file so that it doesn't get checked again.

If the port 80 or the port 8080 is open then I have to resolve the ip with a reverse lookup and get all the domain names that belong to that ip.

the problem is that if I do one of these commands, the output is only the single PTR record, while there can be multiple:

dig -x ipaddres +short
nslookup ipaddress
host ipaddress

I prefere this to be solved in bash, but if there is solution in C, it could help as well

After that I copy the web site page to a file using w3m and I count the word occurrences.

Also here I have another problem, is there a way to check al the available public pages that belong to the site and not only the index one?

Any help is appreciated

回答1:

A lot of websites are not accessible purely by the IP address, due to virtual hosts and such. I'm not sure you'd be getting a uniform distribution of words on the web by doing this. Moreover IP addresses that host websites are not really evenly distributed over by randomly generating 32-bit numbers. Hosting companies with the majority of real websites will be concentrated in small ranges, and a lot of other IPs will be endpoints of ISPs with probably nothing hosted.

Given the above, and the problem you are trying to solve, I would actually recommend getting a distribution of URLs to crawl and computing the word frequency on those. A good tool for doing that would be something like WWW:Mechanize in Python, Perl, Ruby, etc. As your limiting factor is going to be your internet connection and not your processing speed, there's no advantage to doing this in a low-level language. This way, you'll have a higher chance of hitting multiple sites at the same IP.



标签: linux bash dns