I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.
Source:
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup
url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)
encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text = str(soup.findAll(text=True))
print text.decode("utf-8")
Excerpt Output:
...Odenw\xc3\xa4lderisch...
this should be Odenwälderisch
It's not BeautifulSoup's fault. You can see this by printing out
encodedText
, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
A look at the
requests
documentation shows thatr.text
is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.There are two ways to get around this:
r.content
, as Martijn suggested. Then you can decode them yourself to turn them into characters.requests
do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can setr.encoding = 'utf-8'
. If you do this before you accessr.text
, then when you do accessr.text
, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.
You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.
First of all, don't use
response.text
! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. Therequests
library will default to Latin-1 encoding fortext/*
content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.See the Encoding section of the Advanced documentation:
Bold emphasis mine.
Pass in the
response.content
raw data instead:I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the
beautifulsoup4
project, and usefrom bs4 import BeautifulSoup
.BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML
<meta>
tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first ifrequests
used a default:Last but not least, with BeautifulSoup 4, you can extract all text from a page using
soup.get_text()
:You are instead converting a result list (the return value of
soup.findAll()
) to a string. This never can work because containers in Python userepr()
on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.There are a couple of errors in your code:
First of all, your attempt at re-encoding the text is not needed. Requests can give you the native encoding of the page and BeautifulSoup can take this info and do the decoding itself:
Second of all, you have an encoding issue. You are probably trying to visualize the results on the terminal. What you will get is the unicode representation of the characters in the text for every character that is not in the ASCII set. You can check the results like this: