I have created a spider using Portia web scraper and the start URL is
https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs
While scheduling this spider in scrapyd I am getting
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>
What does the ['partial']
mean and why the content from the page is not scraped by the spdier?
Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the
partial
flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:Possible causes include:
handle_httpstatus_list
orhandle_httpstatus_all
such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware