I am trying the Scrapy framework to extract some information from LinkedIn. I am aware that they are very strict with people trying to crawl their website, so I tried a different user agent in my settings.py. I also specified a high download delay but it still seems to block me right off the bat.
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
REDIRECT_ENABLED = False
RETRY_ENABLED = False
DEPTH_LIMIT = 5
DOWNLOAD_TIMEOUT = 10
REACTOR_THREADPOOL_MAXSIZE = 20
CONCURRENT_REQUESTS_PER_DOMAIN = 2
COOKIES_ENABLED = False
HTTPCACHE_ENABLED = True
This is the error I receive:
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider opened
2017-03-20 19:11:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2017-03-20 19:11:29 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023
2017-03-20 19:11:29 [scrapy.core.engine] DEBUG: Crawled (999) <GET
https://www.linkedin.com/directory/people-1/> (referer: None) ['cached']
2017-03-20 19:11:29 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<999 https://www.linkedin.com/directory/people-1/>: HTTP status code is not handled or
not allowed
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-20 19:11:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2372,
'downloader/response_count': 1,
'downloader/response_status_count/999': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 503000),
'httpcache/hit': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 378000)}
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider closed (finished)
The spider itself just prints the visited url.
class InfoSpider(CrawlSpider):
name = "info"
allowed_domains = ["www.linkedin.com"]
start_urls = ['https://www.linkedin.com/directory/people-1/']
rules = [
Rule(LinkExtractor(
allow=[r'.*']),
callback='parse',
follow=True)
]
def parse(self, response):
print(response.url)
Notice headers carefully in the requests. LinkedIn requires the following headers in each requests to serve the response.
You can refer to this documentation for more information.
You have to login linkedin first before crawl any other pages. To login with scrapy, you can refer to https://doc.scrapy.org/en/latest/topics/request-response.html#formrequest-objects
UPDATE 1: here is an example of my code.
Do remember to call self.initialized() if you are using InitSpider, or the parse() method won't be called.