What is the default content-type/charset?

According to this answer: urllib2 read to Unicode

I have to get the content-type in order to change to unicode. However, some websites don't have a "charset".

For example, the ['content-type'] for this page is "text/html". I can't convert it to unicode.

encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

标签： python html unicode encoding

5条回答

Bombasti

2楼-- · 2019-02-17 14:50

If there's no explicit content type, it should be ISO-8859-1 as stated earlier in the answers. Unfortunately that's not always the case, which is why browser developers spent some time on getting algorithms going that try to guess the content type based on the content of your page.

Luckily for you, Mark Pilgrim did all the hard work on porting the firefox implementation to python, in the form of the chardet module. His introduction on how it works for one of the chapters of Dive Into Python 3 is also well worth reading.

0人赞添加讨论(0) 举报

孤傲高冷的网名

3楼-- · 2019-02-17 14:56

In theory, the default charset is ISO-8859-1. But often, this cannot be relied on. Websites which don’t send an explicit charset deserve to be reprimanded. Care to send off an angry email to the webmaster of Endgadget?

0人赞添加讨论(0) 举报

手持菜刀，她持情操

4楼-- · 2019-02-17 14:58

Well I just browsed the given URL, which redirects to

http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video

then hit Crtl-U (view source) in FireFox and it shows

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

@Konrad: what do you mean "seems as though ... uses ISO-8859-1"??

@alex: what makes you think it doesn't have a "charset"??

Look at the code you have (which we GUESS is the line that cause the error (please always show FULL traceback and error message!)):

htmlSource = unicode(htmlSource, encoding)

and the error message:

TypeError: 'int' object is not callable

That means that unicode doesn't refer to the built-in function, it refers to an int. I recall that in your other question you had something like

if unicode == 1:

I suggest that you use some other name for that variable -- say use_unicode.

More suggestions: (1) always show enough code to reproduce the error (2) always read the error message.

0人赞添加讨论(0) 举报

放我归山

5楼-- · 2019-02-17 15:02

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

No, there isn't. You must guess.

Trivial approach: try and decode as UTF-8. If it works, great, it's probably UTF-8. If it doesn't, choose a most-likely encoding for the kinds of pages you're browsing. For English pages that's cp1252, the Windows Western European encoding. (Which is like ISO-8859-1; in fact most browsers will use cp1252 instead of iso-8859-1 even if you specify that charset, so it's worth duplicating that behaviour.)

If you need to guess other languages, it gets very hairy. There are existing modules to help you guess in these situations. See eg. chardet.

0人赞添加讨论(0) 举报

老娘就宠你

6楼-- · 2019-02-17 15:11

htmlSource=htmlSource.decode("utf8") should work for most cases, except you are crawling non-english encoding sites.

or you could write force decode function like this

def forcedecode(text):
    for x in ["utf8","sjis","cp1252","utf16"]:
        try:return text.decode(x)
        except:pass
    return "Unknown Encoding"

0人赞添加讨论(0) 举报

What is the default content-type/charset?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间