Portia Spider logs showing ['Partial'] dur

I have created a spider using Portia web scraper and the start URL is

https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

While scheduling this spider in scrapyd I am getting

DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>

What does the ['partial'] mean and why the content from the page is not scraped by the spdier?

标签： python web-scraping scrapy scrapyd portia

1条回答

迷人小祖宗

2楼-- · 2020-03-06 04:20

Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the partial flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:

This only occurs when making requests to HTTP servers which do not set Content-Length or a Transfer-Encoding in the response

Possible causes include:

The server is misconfigured
There's a proxy involved that's blocking some headers
You get a response that doesn't normally have Content-Length, e.g. redirects (301, 302, 303), but you've set handle_httpstatus_list or handle_httpstatus_all such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware

0人赞添加讨论(0) 举报

Portia Spider logs showing ['Partial'] dur

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间