i need get a page source (html) and convert him to uft8, because i want find some text in this page( like, if 'my_same_text' in page_source: then...). This page contains russian text (сyrillic symbols), and this tag
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
I use flask, and request python lib. i send request source = requests.get('url/')
if 'сyrillic symbols' in source.text: ...
and i can`t find my text, this is due to the encoding how i can convert text to utf8? i try .encode() .decode() but it did not help.
As documented,
requests
automatically decoderesponse.text
to unicode, so you must either look for a unicode string:or encode
response.text
in the appropriate encoding:The first solution being much simpler and lighter.
Let's create a page with an
windows-1251
charset given inmeta
tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.You can use a little trick in the
requests
library:So it goes like that:
Voila!
If it doesn't work for you, there's a slightly uglier approach.
You should take a look at what encoding do the web-server is sending you.
It may be that the encoding of the response is actually
cp1252
(also known asISO-8859-1
), or whatever else, but neitherutf8
norcp1251
. It may differ and depends on a web-server!So we should recode it accordingly.
But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using
requests
itself.