I get the error "urllib.error.HTTPError: HTTP Error 403: Forbidden" when scraping certain pages, and understand that adding something like hdr = {"User-Agent': 'Mozilla/5.0"}
to the header is the solution for this.
However I can't make it work when the URL's I'm trying to scrape is in a separate source file. How/where can I add the User-Agent to the code below?
from bs4 import BeautifulSoup
import urllib.request as urllib2
import time
list_open = open("source-urls.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
name = soup.find(attrs={'class': "name"})
description = soup.find(attrs={'class': "description"})
for text in description:
print(name.get_text(), ';', description.get_text())
# time.sleep(5)
i += 1
You can achieve same using
requests
Hope it helps!