Python 'ascii' codec can't encode char

2019-07-09 13:24发布

问题:

I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text

After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1) and plain_text.decode('cp1252') so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).

EDIT

the new code after @ChrisKoston suggestion using .content instead of .text

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')

encoding and decoding is now possible but still the character problem.

EDIT2

the solution is to set it .content.decode('cp1252')

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')

Special thanks to Tomalak for the solution

回答1:

You must actually store the result of decode() somewhere because it does not modify the original variable.

Another thing:

  • decode() turns a list of bytes into a string.
  • encode() does the oposite, it turns a string into a list of bytes

BeautifulSoup is happy with strings; you don't need to use encode() at all.

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)
html = response.content.decode('cp1252')
soup = BeautifulSoup(html, 'html.parser')

Hint: For working with HTML you might want to look at pyquery instead of BeautifulSoup.