I have been trying to create a Python Web Crawler that finds a web page, read a list of links, returns the link in pre-specified position, and does that for a certain number of times (defined by the count variable). My issue is that I have not been able to find a way to automate the process, and I have to continuously input the link that the code finds.
Here is my code:
The first URL is http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html
The count_1 is equal to 7
The position is equal to 8
##Here is my code:
import urllib
from bs4 import BeautifulSoup
count_1 = raw_input('Enter count: ')
position = raw_input('Enter position: ')
count = int(count_1)
while count > 0:
list_of_tags = list()
url = raw_input("Enter URL: ")
fhand = urllib.urlopen(url).read()
soup = BeautifulSoup(fhand,"lxml")
tags = soup("a")
for tag in tags:
list_of_tags.append(tag.get("href",None))
print list_of_tags[int(position)]
count -=1
All help is appreciated
I've prepared some code with comments. Let me know if you have any doubts or further questions.
Here you go:
import requests
from lxml import html
def searchRecordInSpecificPosition(url, position):
## Making request to the specified URL
response = requests.get(url)
## Parsing the DOM to a tree
tree = html.fromstring(response.content)
## Creating a dict of links.
links_dict = dict()
## Format of the dictionary:
##
## {
## 1: {
## 'href': "http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Medina.html",
## 'text': "Medina"
## },
##
## 2: {
## 'href': "http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Chiara.html",
## 'text': "Chiara"
## },
##
## ... and so on...
## }
counter = 1
## For each <a> tag found, extract its text and link (href) and insert it into links_dict
for link in tree.xpath('//ul/li/a'):
href = link.xpath('.//@href')[0]
text = link.xpath('.//text()')[0]
links_dict[counter] = dict(href=href, text=text)
counter += 1
return links_dict[position]['text'], links_dict[position]['href']
times_to_search = int(raw_input("Enter the amount of times to search: "))
position = int(raw_input('Enter position: '))
count = 0
print ""
while count < times_to_search:
if count == 0:
name, url = searchRecordInSpecificPosition("http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html", position)
else:
name, url = searchRecordInSpecificPosition(url, position)
print "[*] Name: {}".format(name)
print "[*] URL: {}".format(url)
print ""
count += 1
Sample output:
➜ python scraper.py
Enter the amount of times to search: 4
Enter position: 1
[*] Name: Medina
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Medina.html
[*] Name: Darrius
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Darrius.html
[*] Name: Caydence
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Caydence.html
[*] Name: Peaches
[*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Peaches.html
➜