Apologies if this is too ignorant a question or has been asked before. A cursory look did not find anything matching this exactly. The question is: how can I download all Word documents that Google has indexed? It would be a daunting task indeed to do it by hand... Thanks for all pointers.
标签:
web-crawler
相关问题
- How to get JavaScript object in JavaScript code?
- selenium.common.exceptions.WebDriverException: Mes
- How can I scrape LinkedIn company pages with cURL
- Scrapy rules not working when process_request and
- Getting TCP connection timed out: 110: Connection
相关文章
- Scrapy - Select specific link based on text
- Importing URLs for JSOUP to Scrape via Spreadsheet
- Preventing my PHP Web Crawler from Stalling
- Scrapy - Crawl and Scrape a website
- Is this Anti-Scraping technique viable with Robots
- Any idea on how to scrape pages which are behind _
- Symfony2 Crawler - Use UTF-8 with XPATH
- Unable to use proxies in Scrapy project
I'm afraid, there is no legal way to do it. Formerly Google supplied a SOAP API to their websearch but it's deprecated and to be closed this summer. It had a limitation of 1000 queries a day.
Currently Google provides an Ajax Search API but it brings no solution for you as the largest result set contains 8 results.
And finally, there is the standard webform at google.com which is prohibited to query programmatically. (And there is also a limitation that Google only returns the first thousand results, you cannot see more.)
If you want to build a service on this, you can contact Google and make a partnership with them.