I'm making a script that calculate the distribution of the words in the web. What I have to do is check as many random web sites as I can and count the number of words in those sites, list them, and order them so that the word that occurs the majority of the times is a the top of the list. What I'm doing is generating random ip numbers:
a=`expr $RANDOM % 255`
let "a+=1"
b=`expr $RANDOM % 256`
c=`expr $RANDOM % 256`
d=`expr $RANDOM % 256`
ip=$a.$b.$c.$d
after that with nmap I check if the port 80 or 8080 is open on those sites so that there is a chance that it's a web site.
if I'm sure the ip doesn't belong to a web site I add the address to a black list file so that it doesn't get checked again.
If the port 80 or the port 8080 is open then I have to resolve the ip with a reverse lookup and get all the domain names that belong to that ip.
the problem is that if I do one of these commands, the output is only the single PTR record, while there can be multiple:
dig -x ipaddres +short
nslookup ipaddress
host ipaddress
I prefere this to be solved in bash, but if there is solution in C, it could help as well
After that I copy the web site page to a file using w3m and I count the word occurrences.
Also here I have another problem, is there a way to check al the available public pages that belong to the site and not only the index one?
Any help is appreciated