DNS lookup failed: address 'your.proxy.com'

2019-07-24 05:46发布

问题:

This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates

While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below.

class LinkedPySpider(InitSpider):
    name = 'Linkedin'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls=["http://www.linkedin.com/nhome/"]

[Also tried with this start url] 
start_urls =
["http://www.linkedin.com/profile/view?id=38210724&trk=nav_responsive_tab_profile"]

I also tried changing the start_url to the second one(commented), to see if I could start scraping from my own profile page, I was unable to do so.

**Error that I get** - 
scrapy crawl Linkedin
**2013-07-29 11:37:10+0530 [Linkedin] DEBUG: Retrying <GET http://www.linkedin.com/nhome/> (failed 1 times): DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname.**


**To see if the Name space was resolved, I tried -:**
nslookup www.linkedin.com #works
nslookup www.linkedin.com/uas/login # I think the depth of pages within a main website, does not resolve, and is normal right ?

Then I also tried to see if the error could have been due to Name Server not resolving and appended the Nameservers as below .
echo $http_proxy #gives http://username:password@your.proxy.com:80
sudo vi /etc/resolv.conf
and appended the free fast dns nameservers IP address as follows to this file :
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 202.51.5.52

I am not too good with NS conflicts and DNS lookup failures, but could this be due to the fact that I am in a VM - though other scraping projects seemed to work just fine ?

My base use-case is to be able to extract connections and the list of companies they worked at, and a bunch of other attributes. So, I want to crawl/paginate from the "Connections" (All) in the main profile page, which does NOT show up if I use public profile in the start_url, ie. scrapy shell http://www.linkedin.com/in/ektagrover On passing legitimate XPath via hxs.select - this seems to work, but NOT if I used it along with a spider, since it did not meet my base-usecase(As below)

Question : Is there something wrong with my start_url, or is it just the way that I am "assuming that post the authentication the start_page could come to potentially ANY webpage in that site, when I redirect it post authentication at "https://www.linkedin.com/uas/login"

Work-environment - I am on Oracle VM Virtual Box with ubuntu 12.04 LTS with Python 2.7.3, with Scrapy 0.14.4

What worked/ Answer -- Looks like my proxy server was incorrectly pointing to echo $http_proxy - which gives http://username:password@your.proxy.com:80 [Unset the environment variable $http_proxy ] Just did " http_proxy= " , which unsets the proxy then did echo $http_proxy , which gives null to confirm . Post that just did scrapy crawl Linkedin, which worked through the authentication module. Though I am getting stuck here and there on selenium, but that's for another question. Thank you, @warwaruk

回答1:

**Error that I get** - 
scrapy crawl Linkedin
**2013-07-29 11:37:10+0530 [Linkedin] DEBUG: Retrying <GET http://www.linkedin.com/nhome/> (failed 1 times): DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname.**


**To see if the Name space was resolved, I tried -:**
nslookup www.linkedin.com #works
nslookup www.linkedin.com/uas/login # I think the depth of pages within a main website, does not resolve, and is normal right ?

Then I also tried to see if the error could have been due to Name Server not resolving and appended the Nameservers as below .
echo $http_proxy #gives http://username:password@your.proxy.com:80

You have a proxy set: http://username:password@your.proxy.com:80.

Obviosly, it doesn't exist in Internet:

$ nslookup your.proxy.com
Server:         127.0.1.1
Address:        127.0.1.1#53

** server can't find your.proxy.com: NXDOMAIN

Either unset the environment variable $http_proxy or set up a proxy and change the env. variable accordingly.