Warning: Some characters could not be decoded, and

I'm creating a script to download some mp3 podcasts from a site and write them to a certain location. I'm nearly finished, and the files are being downloaded and created. However, I'm running into a problem where the binary data can't be fully decoded and the mp3 files won't play.

Here's my code:

import re
import os
import urllib2
from bs4 import BeautifulSoup
import time

def getHTMLstring(url):
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    soupString = soup.encode('utf-8')
    return soupString

def getList(html_string):
    urlList = re.findall('(http://podcast\.travelsinamathematicalworld\.co\.uk\/mp3/.*\.mp3)', html_string)
    firstUrl = urlList[0]
    finalList = [firstUrl]

    for url in urlList:
        if url != finalList[0]:
            finalList.insert(0,url)

    return finalList

def getBinary(netLocation):
    req = urllib2.urlopen(netLocation)
    reqSoup = BeautifulSoup(req)
    reqString = reqSoup.encode('utf-8')
    return reqString

def getFilename(string):
    splitTerms = string.split('/')
    fileName = splitTerms[-1]
    return fileName

def writeFile(sourceBinary, fileName):
    with open(fileName, 'wb') as fp:
        fp.write(sourceBinary)



def main():
    htmlString = getHTMLstring('http://www.travelsinamathematicalworld.co.uk')
    urlList = getList(htmlString)

    fileFolder = 'D:\\Dropbox\\Mathematics\\Travels in a Mathematical World\\Podcasts'
    os.chdir(fileFolder)

    for url in urlList:
        name = getFilename(url)
        binary = getBinary(url)
        writeFile(binary, name)
        time.sleep(2)



if __name__ == '__main__':
    main()

When I run the code, I get the following warning in my console:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

I'm thinking that it has to do with the fact that the data that I'm using is encoded in UTF-8, and maybe the write method expects a different encoding? I'm new to Python (and really to programming in general), and I'm stuck.

标签： python unicode encoding web-scraping beautifulsoup

1条回答

祖国的老花朵

2楼-- · 2019-07-26 02:00

Assuming that you want to download some mp3 files from urls.
You can retrieve those urls via BeautifulSoup. But you don't need to use BeautifulSoup to parse the urls. You just need to save it directly.
For example,

url = 'http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf'
res = urllib2.urlopen(url)
with open(fileName, 'wb') as fp:
    fp.write(res.read())

If I use BeautifulSoup to parse that pdf url

reqSoup = BeautifulSoup('http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf')

reqSoup is not the pdf file, but a HTML response. Actually, it is

<html><body><p>http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf</p></body></html>

0人赞添加讨论(0) 举报

Warning: Some characters could not be decoded, and

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间