I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text
After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the €
symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1)
and plain_text.decode('cp1252')
so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).
EDIT
the new code after @ChrisKoston suggestion using .content
instead of .text
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')
encoding and decoding is now possible but still the character problem.
EDIT2
the solution is to set it .content.decode('cp1252')
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')
Special thanks to Tomalak for the solution