可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I noted that docplayer.net embeds many pdfs. Example: http://docplayer.net/72489212-Excellence-in-prevention-descriptions-of-the-prevention-programs-and-strategies-with-the-greatest-evidence-of-success.html

However, how does the process of extracting these pdfs (i.e. downloading them) using an automated workflow work?

回答1:

You can notice in browser's developer tools under Network/XHR tab that the actual document is being requested. In your particular case given it's on URL http://docplayer.net/storage/75/72489212/72489212.pdf. Now you can try to look into page source to see if you could infer this URL somehow. It seems that XPath //iframe[@id="player_frame"]/@src could be helpful. I haven't checked with other pages, but I think something like this might work (part of your parse method):

...
url_template = 'http://docplayer.net/storage/{0}/{1}/{1}.pdf'
ids = response.xpath('//iframe[@id="player_frame"]/@src').re(r'/docview/([^/]+)/([^/]+)/')
file_url = url_template.format(*ids)
yield scrapy.Request(file_url, callback=self.parse_pdf)
...

回答2:

As you pointed out, grabbing the URL alone results in a 403 Forbidden. There are two headers you also need, "s" and "ex".

To get these using Firefox, open the Network tab in the inspector, and select "Copy... Copy as cURL". The resulting curl command will be the exact request the browser would have made to fetch the resource. In addition to the "s" and "ex" headers, you will also notice that there is a "Range" header -- make sure to remove this one, unless you only want to download part of the file. The remaining headers are not relevant.

I will not post the resulting direct link to the PDF here, but I did test it and was able to download the entire file with this technique.