I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.
Most of the issues are solvable and I'm having a good little mess around. However I'm hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.
Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.
I think Java or Javascript is a key, this pops up often.
The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don't. I'm using the scrapy library with Python 2.7
I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?
Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).
However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
Some things to note:
You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.
Reference: http://snipplr.com/view/66998/
Another solution would be to implement a download handler or download handler middleware. The following is an example of middleware using selenium with headless phantomjs webdriver:
I wanted to ability to tell different spiders which middleware to use so I implemented this wrapper:
settings.py:
for wrapper to work all spiders must have at minimum:
to include a middleware:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T's solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.
Here is a simple example of using scrapy with ajax request. Let see the site http://www.rubin-kazan.ru/guestbook.html All messages are loaded with an ajax request. My goal is to fetch this messages with all their attributes (author, date, ...).
When I analyse the source code of the page I can't see all these messages because the web page use ajax technology. But I can with Firebug from Mozila Firefox (or an analogy instrument in other browser) to analyse the Http request that generate the messages on the web page.
For this purpose I don't reload all page but only the part of page that contain messages. For this purpose I click an arbitrary number of page on the bottom and I observe the HTTP request that is responsible about message body
After finish I analyse the headers of request (I must quote that this url I'll extract from source page from var section, see the code below).
and the form data content of request (the Http method is "Post")
and the content of response, which is an Json file,
which present all information I'm looking for.
From now I must implement all this knowledge in scrapy. Let's define the spider for this purpose.
In parse function I have the response for first request. In RubiGuessItem I have the json file with all information.
I wonder why no one has posted the solution using Scrapy only.
Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES . The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.
The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.
Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it
Menu->Tools->Developer Tools
. TheNetwork
tab allows you to see all information about every request and response:In the bottom of the picture you can see that I've filtered request down to
XHR
- these are requests made by javascript code.Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.
After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.
Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.
I was using a custom downloader middleware, but wasn't very happy with it, as I didn't manage to make the cache work with it.
A better approach was to implement a custom download handler.
There is a working example here. It looks like this:
Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:
And voilà, the JS parsed DOM, with scrapy cache, retries, etc.