Python correct encoding of Website (Beautiful Soup

2019-01-23 19:42发布

问题:

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.

Source:

# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)

encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text =  str(soup.findAll(text=True))
print text.decode("utf-8")

Excerpt Output:

...Odenw\xc3\xa4lderisch...

this should be Odenwälderisch

回答1:

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(r.content, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.



回答2:

It's not BeautifulSoup's fault. You can see this by printing out encodedText, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.

The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.

A look at the requests documentation shows that r.text is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.

There are two ways to get around this:

  1. Use the raw undecoded bytes, which are stored in r.content, as Martijn suggested. Then you can decode them yourself to turn them into characters.
  2. Let requests do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can set r.encoding = 'utf-8'. If you do this before you access r.text, then when you do access r.text, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.

Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.



回答3:

There are a couple of errors in your code:

  1. First of all, your attempt at re-encoding the text is not needed. Requests can give you the native encoding of the page and BeautifulSoup can take this info and do the decoding itself:

    # -*- coding: utf-8 -*-
    import requests
    from BeautifulSoup import BeautifulSoup
    
    url = "http://www.columbia.edu/~fdc/utf8/"
    r = requests.get(url)
    
    soup = BeautifulSoup(r.text, "html5lib")
    
  2. Second of all, you have an encoding issue. You are probably trying to visualize the results on the terminal. What you will get is the unicode representation of the characters in the text for every character that is not in the ASCII set. You can check the results like this:

    res = [item.encode("ascii","ignore") for item in soup.find_all(text=True)]