urllib.error.HTTPError: HTTP Error 403: Forbidden

2019-09-20 15:42发布

I get the error "urllib.error.HTTPError: HTTP Error 403: Forbidden" when scraping certain pages, and understand that adding something like hdr = {"User-Agent': 'Mozilla/5.0"} to the header is the solution for this.

However I can't make it work when the URL's I'm trying to scrape is in a separate source file. How/where can I add the User-Agent to the code below?

from bs4 import BeautifulSoup
import urllib.request as urllib2
import time

list_open = open("source-urls.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")

i = 0
for url in line_in_list:
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
    name = soup.find(attrs={'class': "name"})
    description = soup.find(attrs={'class': "description"})
    for text in description:
        print(name.get_text(), ';', description.get_text())
#        time.sleep(5)
    i += 1

标签： python http urllib

1条回答

姐就是有狂的资本

2楼-- · 2019-09-20 16:30

You can achieve same using requests

import requests
hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}    
for url in line_in_list:
    resp = requests.get(url, headers=hdrs)
    soup = BeautifulSoup(resp.content, 'html.parser')
    name = soup.find(attrs={'class': "name"})
    description = soup.find(attrs={'class': "description"})
    for text in description:
        print(name.get_text(), ';', description.get_text())
#        time.sleep(5)
    i += 1

Hope it helps!

0人赞添加讨论(0) 举报

urllib.error.HTTPError: HTTP Error 403: Forbidden

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间