Django-dynamic-scraper unable to scrape the data

2019-08-08 08:23发布

问题:

I am new to using dynamic scraper, and I have used the following sample for learningopen_news. I have everything set up but it keeps me showing the same error: dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.

2015-11-20 18:45:11+0000 [article_spider] ERROR: Spider error processing <GET https://en.wikinews.org/wiki/Main_page>
Traceback (most recent call last):
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 645, in _tick
    taskObj._oneWorkUnit()
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 491, in _oneWorkUnit
    result = next(self._iterator)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
    yield next(it)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
    for x in result:
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/spiders/django_spider.py", line 378, in parse
    rpt = self.scraper.get_rpt_for_scraped_obj_attr(url_elem.scraped_obj_attr)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/models.py", line 98, in get_rpt_for_scraped_obj_attr
    return self.requestpagetype_set.get(scraped_obj_attr=soa)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/query.py", line 334, in get
    self.model._meta.object_name
dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.

回答1:

This is caused by "REQUEST PAGE TYPES" is missing. Each "SCRAPER ELEMS" must have it's own "REQUEST PAGE TYPES".

To solve this problem, please follow the steps below:

  1. Login admin page (usually http://localhost:8000/admin/)
  2. Go to Home › Dynamic_Scraper › Scrapers › Wikinews Scraper (Article)
  3. Click on "Add another Request page type" under "REQUEST PAGE TYPES"
  4. Create 4 "REQUEST PAGE TYPES" in total for each "(base (Article))", "(title (Article))", "(description (Article))" and "(url (Article))"

"REQUEST PAGE TYPES" Settings

All "Content type" are "HTML"

All "Request type" are "Request"

All "Method" are "Get"

For "Page type", just assign them in sequence like

(base (Article)) | Main Page

(title (Article)) | Detail Page 1

(description (Article) | Detail Page 2

(url (Article)) | Detail Page 3

After the steps above you should fix "DoesNotExist: RequestPageType" error.

However, "ERROR: Mandatory elem title missing!" would come up!

To solve this. I suggest you changing all "REQUEST PAGE TYPE" in "SCRAPER ELEMS" to "Main Page" including "title (Article)".

And then change the XPath as follow:

(base (Article)) | //td[@class="l_box"]

(title (Article)) | span[@class="l_title"]/a/@title

(description (Article) | p/span[@class="l_summary"]/text()

(url (Article)) | span[@class="l_title"]/a/@href

After all, run scrapy crawl article_spider -a id=1 -a do_action=yes on command prompt. You should be able to crawl the "Article". You may check it from Home › Open_News › Articles

Enjoy~