999 response when trying to crawl LinkedIn with Sc

2020-06-27 06:14发布

I am trying the Scrapy framework to extract some information from LinkedIn. I am aware that they are very strict with people trying to crawl their website, so I tried a different user agent in my settings.py. I also specified a high download delay but it still seems to block me right off the bat.

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
REDIRECT_ENABLED = False
RETRY_ENABLED = False
DEPTH_LIMIT = 5
DOWNLOAD_TIMEOUT = 10
REACTOR_THREADPOOL_MAXSIZE = 20
CONCURRENT_REQUESTS_PER_DOMAIN = 2
COOKIES_ENABLED = False
HTTPCACHE_ENABLED = True

This is the error I receive:

2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider opened
2017-03-20 19:11:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2017-03-20 19:11:29 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023
2017-03-20 19:11:29 [scrapy.core.engine] DEBUG: Crawled (999) <GET
https://www.linkedin.com/directory/people-1/> (referer: None) ['cached']
2017-03-20 19:11:29 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<999 https://www.linkedin.com/directory/people-1/>: HTTP status code is not handled or 
not allowed
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-20 19:11:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2372,
'downloader/response_count': 1,
'downloader/response_status_count/999': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 503000),
'httpcache/hit': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 378000)}
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider closed (finished)

The spider itself just prints the visited url.

class InfoSpider(CrawlSpider):
    name = "info"
    allowed_domains = ["www.linkedin.com"]
    start_urls = ['https://www.linkedin.com/directory/people-1/']
    rules = [
        Rule(LinkExtractor(
            allow=[r'.*']),
            callback='parse',
            follow=True)
    ]
    def parse(self, response):
        print(response.url)

2条回答
【Aperson】
2楼-- · 2020-06-27 06:29

Notice headers carefully in the requests. LinkedIn requires the following headers in each requests to serve the response.

headers = {
    "accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding" : "gzip, deflate, sdch, br",
    "accept-language" : "en-US,en;q=0.8,ms;q=0.6",
    "user-agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
}

You can refer to this documentation for more information.

查看更多
一纸荒年 Trace。
3楼-- · 2020-06-27 06:38

You have to login linkedin first before crawl any other pages. To login with scrapy, you can refer to https://doc.scrapy.org/en/latest/topics/request-response.html#formrequest-objects

UPDATE 1: here is an example of my code.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request, FormRequest
from linkedin.items import *

class LinkedinSpider(InitSpider):

"""
Define the crawler's start URIs, set its follow rules, parse HTML
and assign values to an item. Processing occurs in ../pipelines.py
"""

name = "linkedin"
allowed_domains = ["linkedin.com"]
user_name = 'my_user_name'
passwd = 'my_passwd'

# Uncomment the following lines for full spidering
# start_urls = ["http://www.linkedin.com/directory/people-%s-%d-%d-%d"
#               % (alphanum, num_one, num_two, num_three)
#                 for alphanum in "abcdefghijklmnopqrstuvwxyz"
#                 for num_one in xrange(1,11)
#                 for num_two in xrange(1,11)
#                 for num_three in xrange(1,11)
#               ]

# Temporary start_urls for testing; remove and use the above start_urls in production
# start_urls = ["http://www.linkedin.com/directory/people-a-23-23-2"]
start_urls = ["https://www.linkedin.com/in/rebecca-liu-93a12a28/"]
login_page = 'https://www.linkedin.com/uas/login'
# TODO: allow /in/name urls too?
# rules = (
#     Rule(SgmlLinkExtractor(allow=('\/pub\/.+')),
#          callback='parse_item'),
# )

def init_request(self):
    return Request(url=self.login_page,callback=self.login)

def login(self,response):
    return FormRequest.from_response(response,formdata={
        'session_key':user_name,'session_password':passwd
    },
                                     callback = self.check_login_response)

def check_login_response(self,response):
    return self.initialized()

Do remember to call self.initialized() if you are using InitSpider, or the parse() method won't be called.

查看更多
登录 后发表回答