After writing some code in python, I've got stuck in deep trouble. I'm a newbie in writing code following the OOP design in python. The xpaths I've used in my code are flawless. I'm getting lost when it comes to run the "passing_links" method in my "info_grabber" class through the instance of "page_crawler" class. Every time I run my code I get an error "'page_crawler' object has no attribute 'passing_links'". Perhaps the way I've written my class-crawler is not how it should be. However, as I've spent few hours on it so I suppose I might get any suggestion as to which lines I should rectify to make it work. Thanks in advance for taking a look into it:
from lxml import html
import requests
class page_crawler(object):
main_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
base_link = "https://www.yellowpages.com"
def __init__(self):
self.links = [self.main_link]
def crawler(self):
for link in self.links:
self.get_link(link)
def get_link(self, link):
print("Running page "+ link)
page = requests.get(link)
tree = html.fromstring(page.text)
item_links = tree.xpath('//h2[@class="n"]/a[@class="business-name"][not(@itemprop="name")]/@href')
for item_link in item_links:
return self.base_link + item_link
links = tree.xpath('//div[@class="pagination"]//li/a/@href')
for url in links:
if not self.base_link + url in self.links:
self.links += [self.base_link + url]
class Info_grabber(page_crawler):
def __init__(self, plinks):
page_crawler.__init__(self)
self.plinks = [plinks]
def passing_links(self):
for nlink in self.plinks:
print(nlink)
self.crawling_deep(nlink)
def crawling_deep(self, uurl):
page = requests.get(uurl)
tree = html.fromstring(page.text)
name = tree.findtext('.//div[@class="sales-info"]/h1')
phone = tree.findtext('.//p[@class="phone"]')
try:
email = tree.xpath('//div[@class="business-card-footer"]/a[@class="email-business"]/@href')[0]
except IndexError:
email=""
print(name, phone, email)
if __name__ == '__main__':
crawl = Info_grabber(page_crawler)
crawl.crawler()
crawl.passing_links()
Now upon execution I get a new error "raise MissingSchema(error)" when it hits the line "self.crawling_deep(nlink)"
Your
crawl
is an instance of the page crawler class, but not the InfoGrabber class, which is the class that has the methodpassing_links
. I think what you want to do is make crawl an instance of InfoGrabber instead.Then I believe before doing self.crawling_deep you must do:
I'm not sure i understand what you're trying to do in
page_crawler.get_link
, but i think you should have a different method for collecting "pagination" links.I renamed
Info_grabber.plinks
toInfo_grabber.links
so that thepage_crawler.crawler
can access them, and managed to extract info from several pages, however the code is far from ideal.You'll notice that i added a
pages
property and aget_pages
method inpage_crawler
, i'll leave the implementation part to you.You might need to add more methods to
page_crawler
later on, as they could be of use if you develop more child classes. Finally consider looking into composition as it is also a strong OOP feature.