I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP.
The way I do is with a simple for
to iterate for the page list, a wget
do download it and sed
, tr
, awk
or other utilities to clean the page and grab the specific info I need.
All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything
I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.