HTML Link parsing using BeautifulSoup

2019-06-14 08:46发布

问题:

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!

import urllib
from bs4 import BeautifulSoup

rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):  
    #iterate url and capture content
    sock = urllib.urlopen(url+ str(i))
    html = sock.read()  
    sock.close()
    rawHtml += html
    print i

Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.

soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
    t += str(link.get('href')) + "</br>"
    #t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()

what could be the possible issue. Is it the problem with the socket configuration or some other issue.

This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.

 d:\python>python ext.py
Traceback (most recent call last):
  File "ext.py", line 8, in <module>
    sock = urllib.urlopen(url+ str(i))
  File "d:\python\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "d:\python\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "d:\python\lib\urllib.py", line 350, in open_http
    h.endheaders(data)
  File "d:\python\lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "d:\python\lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "d:\python\lib\httplib.py", line 855, in send
    self.connect()
  File "d:\python\lib\httplib.py", line 832, in connect
    self.timeout, self.source_address)
  File "d:\python\lib\socket.py", line 557, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.

回答1:

Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.