I noted that docplayer.net embeds many pdfs. Example: http://docplayer.net/72489212-Excellence-in-prevention-descriptions-of-the-prevention-programs-and-strategies-with-the-greatest-evidence-of-success.html
However, how does the process of extracting these pdfs (i.e. downloading them) using an automated workflow work?
You can notice in browser's developer tools under Network/XHR tab that the actual document is being requested. In your particular case given it's on URL http://docplayer.net/storage/75/72489212/72489212.pdf. Now you can try to look into page source to see if you could infer this URL somehow. It seems that XPath //iframe[@id="player_frame"]/@src
could be helpful. I haven't checked with other pages, but I think something like this might work (part of your parse
method):
...
url_template = 'http://docplayer.net/storage/{0}/{1}/{1}.pdf'
ids = response.xpath('//iframe[@id="player_frame"]/@src').re(r'/docview/([^/]+)/([^/]+)/')
file_url = url_template.format(*ids)
yield scrapy.Request(file_url, callback=self.parse_pdf)
...
As you pointed out, grabbing the URL alone results in a 403 Forbidden. There are two headers you also need, "s" and "ex".
To get these using Firefox, open the Network tab in the inspector, and select "Copy... Copy as cURL". The resulting curl command will be the exact request the browser would have made to fetch the resource. In addition to the "s" and "ex" headers, you will also notice that there is a "Range" header -- make sure to remove this one, unless you only want to download part of the file. The remaining headers are not relevant.
I will not post the resulting direct link to the PDF here, but I did test it and was able to download the entire file with this technique.