I am using Ruby on Rails with the Mechanize library to scrape store websites. The problem is that many times I can't crawl certain elements. However, I can see this when I 'view source' on the site.
For example, Walmart's category (in this case below it is "Health") is unscapeable. I believe this is because it is dynamically produced HTML (e.g. from javascript). In order to scrape this, I need a browser to process the web request.
http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376
I am also using a linux machine on Amazon EC2. It would be tough to install browser for UI scraping. Is there any Rails gem/plugin that can help me?
Thanks, all!!
Your question, rephrased, is, what's an easy way to parse an HTML document's DOM in the same way a web browser would, then execute the JavaScript in the document against the parsed DOM? Without running an actual web browser.
That's a little tricky.
However, all is not lost. Take a look at Capybara. Though created for acceptance testing you can also use it for general grokking of documents. To execute JavaScript you'll need to use a driver that supports it, and since you want it to be "headless" (no browser GUI) that probably means using capybara-webkit, Akephalos or capybara-envjs.
Another option might be Harmony, which I know nothing about except that it appears to do what you want but also appears not to be maintained anymore, so YMMV.