I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.
Update:
I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.
I can build a unit-test test class and in a test:
- create a response object
- try to call the parse method of my spider with the response object
However it ends up generating this traceback. Any insight as to why?
The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py
That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
Slightly simpler, by removing the
def fake_response_from_file
from the chosen answer:I'm using Twisted's
trial
to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of theCrawlerRunner
without worrying about starting and stopping one in the tests.Stealing some ideas from the
check
andparse
Scrapy commands I ended up with the following baseTestCase
class to run assertions against live sites:Example:
or perform one request in the setup and run multiple tests against the results:
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.
You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.
The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.