Send request to page with windows-1251 encoding fr

2019-08-11 16:10发布

问题:

i need get a page source (html) and convert him to uft8, because i want find some text in this page( like, if 'my_same_text' in page_source: then...). This page contains russian text (сyrillic symbols), and this tag

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

I use flask, and request python lib. i send request source = requests.get('url/')

if 'сyrillic symbols' in source.text: ...

and i can`t find my text, this is due to the encoding how i can convert text to utf8? i try .encode() .decode() but it did not help.

回答1:

Let's create a page with an windows-1251 charset given in meta tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
 </head>
 <body>
  <p>Привет, мир!</p>
 </body>
</html>

You can use a little trick in the requests library:

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.

So it goes like that:

In [1]: import requests

In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')

In [3]: result.encoding = 'windows-1251'

In [4]: u'Привет' in result.text
Out[4]: True

Voila!

If it doesn't work for you, there's a slightly uglier approach.

You should take a look at what encoding do the web-server is sending you.

It may be that the encoding of the response is actually cp1252 (also known as ISO-8859-1), or whatever else, but neither utf8 nor cp1251. It may differ and depends on a web-server!

In [1]: import requests

In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')

In [3]: result.encoding
Out[3]: 'ISO-8859-1'

So we should recode it accordingly.

In [4]: u'Привет'.encode('cp1251').decode('cp1252') in result.text
Out[4]: True

But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using requests itself.



回答2:

As documented, requests automatically decode response.text to unicode, so you must either look for a unicode string:

if u'cyrillic symbols' in source.text:
    # ...

or encode response.text in the appropriate encoding:

# -*- coding: utf-8 -*-
# (....)
if 'cyrillic symbols' in source.text.encode("utf-8"):
    # ...

The first solution being much simpler and lighter.