Scrapy recursive link crawler

2019-07-23 21:02发布

It starts with a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.

How can i do that using python and scrapy?. I am able to scrape all links in a webpage but how to perform it recursively in depth

2条回答
We Are One
2楼-- · 2019-07-23 21:40

Here is the main crawl method written to scrap links recursively from a webpage. This method will crawl a URL and put all the crawled URLs into a buffer. Now multiple threads will be waiting to pop URLs from this global buffer and again call this crawl method.

def crawl(self,urlObj):
    '''Main function to crawl URL's '''

    try:
        if ((urlObj.valid) and (urlObj.url not in CRAWLED_URLS.keys())):
            rsp = urlcon.urlopen(urlObj.url,timeout=2)
            hCode = rsp.read()
            soup = BeautifulSoup(hCode)
            links = self.scrap(soup)
            boolStatus = self.checkmax()
            if boolStatus:
                CRAWLED_URLS.setdefault(urlObj.url,"True")
            else:
                return
            for eachLink in links:
                if eachLink not in VISITED_URLS:
                    parsedURL = urlparse(eachLink)
                    if parsedURL.scheme and "javascript" in parsedURL.scheme:
                        #print("***************Javascript found in scheme " + str(eachLink) + "**************")
                        continue
                    '''Handle internal URLs '''
                    try:
                        if not parsedURL.scheme and not parsedURL.netloc:
                            #print("No scheme and host found for "  + str(eachLink))
                            newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme,"netloc":urlObj.netloc}))
                            eachLink = newURL
                        elif not parsedURL.scheme :
                            #print("Scheme not found for " + str(eachLink))
                            newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme}))
                            eachLink = newURL
                        if eachLink not in VISITED_URLS: #Check again for internal URL's
                            #print(" Found child link " + eachLink)
                            CRAWL_BUFFER.append(eachLink)
                            with self._lock:
                                self.count += 1
                                #print(" Count is =================> " + str(self.count))
                            boolStatus = self.checkmax()
                            if boolStatus:
                                VISITED_URLS.setdefault(eachLink, "True")
                            else:
                                return
                    except TypeError:
                        print("Type error occured ")
        else:
            print("URL already present in visited " + str(urlObj.url))
    except socket.timeout as e:
        print("**************** Socket timeout occured*******************" )
    except URLError as e:
        if isinstance(e.reason, ConnectionRefusedError):
            print("**************** Conn refused error occured*******************")
        elif isinstance(e.reason, socket.timeout):
            print("**************** Socket timed out error occured***************" )
        elif isinstance(e.reason, OSError):
            print("**************** OS error occured*************")
        elif isinstance(e,HTTPError):
            print("**************** HTTP Error occured*************")
        else:
            print("**************** URL Error occured***************")
    except Exception as e:
        print("Unknown exception occured while fetching HTML code" + str(e))
        traceback.print_exc()

The complete source code and instructions are available at https://github.com/tarunbansal/crawler

查看更多
孤傲高冷的网名
3楼-- · 2019-07-23 21:53

Several remarks :

  • you don't need Scrapy for such a simple task. Urllib (or Requests) and a html parser (Beautiful soup, etc.) can do the job
  • I don't recall where I've heard it, but I think it's better to crawl using BFS algorithms. You can easily avoid circular references.

Below a simple implementation : it does not fetcch internal links (only absolute formed hyperlinks) nor does it have any Error handling (403,404,no links,...), and it is abysmally slow ( the multiprocessing module can help a lot in this case).

import BeautifulSoup
import urllib2
import itertools
import random


class Crawler(object):
    """docstring for Crawler"""

    def __init__(self):

        self.soup = None                                        # Beautiful Soup object
        self.current_page   = "http://www.python.org/"          # Current page's address
        self.links          = set()                             # Queue with every links fetched
        self.visited_links  = set()

        self.counter = 0 # Simple counter for debug purpose

    def open(self):

        # Open url
        print self.counter , ":", self.current_page
        res = urllib2.urlopen(self.current_page)
        html_code = res.read()
        self.visited_links.add(self.current_page) 

        # Fetch every links
        self.soup = BeautifulSoup.BeautifulSoup(html_code)

        page_links = []
        try :
            page_links = itertools.ifilter(  # Only deal with absolute links 
                                            lambda href: 'http://' in href,
                                                ( a.get('href') for a in self.soup.findAll('a') )  )
        except Exception: # Magnificent exception handling
            pass



        # Update links 
        self.links = self.links.union( set(page_links) ) 



        # Choose a random url from non-visited set
        self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
        self.counter+=1


    def run(self):

        # Crawl 3 webpages (or stop if all url has been fetched)
        while len(self.visited_links) < 3 or (self.visited_links == self.links):
            self.open()

        for link in self.links:
            print link



if __name__ == '__main__':

    C = Crawler()
    C.run()

Output:

In [48]: run BFScrawler.py
0 : http://www.python.org/
1 : http://twistedmatrix.com/trac/
2 : http://www.flowroute.com/
http://www.egenix.com/files/python/mxODBC.html
http://wiki.python.org/moin/PyQt
http://wiki.python.org/moin/DatabaseProgramming/
http://wiki.python.org/moin/CgiScripts
http://wiki.python.org/moin/WebProgramming
http://trac.edgewall.org/
http://www.facebook.com/flowroute
http://www.flowroute.com/
http://www.opensource.org/licenses/mit-license.php
http://roundup.sourceforge.net/
http://www.zope.org/
http://www.linkedin.com/company/flowroute
http://wiki.python.org/moin/TkInter
http://pypi.python.org/pypi
http://pycon.org/#calendar
http://dyn.com/
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.
google.com/public/basic.ics
http://www.pygame.org/news.html
http://www.turbogears.org/
http://www.openbookproject.net/pybiblio/
http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
http://support.flowroute.com/forums
http://www.pentangle.net/python/handbook/
http://dreamhost.com/?q=twisted
http://www.vrplumber.com/py3d.py
http://sourceforge.net/projects/mysql-python
http://wiki.python.org/moin/GuiProgramming
http://software-carpentry.org/
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.
google.com/public/basic.ics
http://wiki.python.org/moin/WxPython
http://wiki.python.org/moin/PythonXml
http://www.pytennessee.org/
http://labs.twistedmatrix.com/
http://www.found.no/
http://www.prnewswire.com/news-releases/voip-innovator-flowroute-relocates-to-se
attle-190011751.html
http://www.timparkin.co.uk/
http://docs.python.org/howto/sockets.html
http://blog.python.org/
http://docs.python.org/devguide/
http://www.djangoproject.com/
http://buildbot.net/trac
http://docs.python.org/3/
http://www.prnewswire.com/news-releases/flowroute-joins-voxbones-inum-network-fo
r-global-voip-calling-197319371.html
http://www.psfmember.org
http://docs.python.org/2/
http://wiki.python.org/moin/Languages
http://sip-trunking.tmcnet.com/topics/enterprise-voip/articles/341902-grandstrea
m-ip-voice-solutions-receive-flowroute-certification.htm
http://www.twitter.com/flowroute
http://wiki.python.org/moin/NumericAndScientific
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.
google.com/public/basic.ics
http://freecode.com/projects/pykyra
http://www.xs4all.com/
http://blog.flowroute.com
http://wiki.python.org/moin/PyGtk
http://twistedmatrix.com/trac/
http://wiki.python.org/moin/
http://wiki.python.org/moin/Python2orPython3
http://stackoverflow.com/questions/tagged/twisted
http://www.pycon.org/
查看更多
登录 后发表回答