可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.
Update:
I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.
I can build a unit-test test class and in a test:
- create a response object
- try to call the parse method of my spider with the response object
However it ends up generating this traceback. Any insight as to why?
回答1:
The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
# scrapyproject/tests/responses/__init__.py
import os
from scrapy.http import Response, Request
def fake_response_from_file(file_name, url=None):
"""
Create a Scrapy fake HTTP response from a HTML file
@param file_name: The relative filename from the responses directory,
but absolute paths are also accepted.
@param url: The URL of the response.
returns: A scrapy HTTP response which can be used for unittesting.
"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
response = Response(url=url,
request=request,
body=file_content)
response.encoding = 'utf-8'
return response
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py
import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file
class OsdirSpiderTest(unittest.TestCase):
def setUp(self):
self.spider = osdir_spider.DirectorySpider()
def _test_item_results(self, results, expected_length):
count = 0
permalinks = set()
for item in results:
self.assertIsNotNone(item['content'])
self.assertIsNotNone(item['title'])
self.assertEqual(count, expected_length)
def test_parse(self):
results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
self._test_item_results(results, 10)
That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
回答2:
The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.
回答3:
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
from scrapy import Spider, Request
from scrapy.http import HtmlResponse
class Example(Spider):
name = 'example'
url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
def start_requests(self):
yield Request(self.url, self.parse)
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield {'image_href': href}
# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase
with Betamax.configure() as config:
# where betamax will store cassettes (http responses):
config.cassette_library_dir = 'cassettes'
config.preserve_exact_body_bytes = True
class TestExample(BetamaxTestCase): # superclass provides self.session
def test_parse(self):
example = Example()
# http response is recorded in a betamax cassette:
response = self.session.get(example.url)
# forge a scrapy response to test
scrapy_response = HtmlResponse(body=response.content, url=example.url)
result = example.parse(scrapy_response)
self.assertEqual({'image_href': u'image1.html'}, result.next())
self.assertEqual({'image_href': u'image2.html'}, result.next())
self.assertEqual({'image_href': u'image3.html'}, result.next())
self.assertEqual({'image_href': u'image4.html'}, result.next())
self.assertEqual({'image_href': u'image5.html'}, result.next())
with self.assertRaises(StopIteration):
result.next()
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.
回答4:
I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:
response = Response(url=url, request=request, body=file_content)
I get:
raise AttributeError("Response content isn't text")
The solution is to use TextResponse instead, and it works ok, as example:
response = TextResponse(url=url, request=request, body=file_content)
Thanks a lot.
回答5:
You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.
回答6:
Slightly simpler, by removing the def fake_response_from_file
from the chosen answer:
import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector
class TestParsers(unittest.TestCase):
def setUp(self):
self.spider = MySpider(limit=1)
self.html = Selector(text=open("some.htm", 'r').read())
def test_some_parse(self):
expected = "some-text"
result = self.spider.some_parse(self.html)
self.assertEqual(result, expected)
if __name__ == '__main__':
unittest.main()
回答7:
I'm using Twisted's trial
to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner
without worrying about starting and stopping one in the tests.
Stealing some ideas from the check
and parse
Scrapy commands I ended up with the following base TestCase
class to run assertions against live sites:
from twisted.trial import unittest
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output
class SpiderTestCase(unittest.TestCase):
def setUp(self):
self.runner = CrawlerRunner()
def make_test_class(self, cls, url):
"""
Make a class that proxies to the original class,
sets up a URL to be called, and gathers the items
and requests returned by the parse function.
"""
class TestSpider(cls):
# This is a once used class, so writing into
# the class variables is fine. The framework
# will instantiate it, not us.
items = []
requests = []
def start_requests(self):
req = super(TestSpider, self).make_requests_from_url(url)
req.meta["_callback"] = req.callback or self.parse
req.callback = self.collect_output
yield req
def collect_output(self, response):
try:
cb = response.request.meta["_callback"]
for x in iterate_spider_output(cb(response)):
if isinstance(x, (BaseItem, dict)):
self.items.append(x)
elif isinstance(x, Request):
self.requests.append(x)
except Exception as ex:
print("ERROR", "Could not execute callback: ", ex)
raise ex
# Returning any requests here would make the crawler follow them.
return None
return TestSpider
Example:
@defer.inlineCallbacks
def test_foo(self):
tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(tester)
self.assertEqual(len(tester.items), 1)
self.assertEqual(len(tester.requests), 2)
or perform one request in the setup and run multiple tests against the results:
@defer.inlineCallbacks
def setUp(self):
super(FooTestCase, self).setUp()
if FooTestCase.tester is None:
FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(self.tester)
def test_foo(self):
self.assertEqual(len(self.tester.items), 1)