Python string encodings and ==

I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).

I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:

agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en

The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:

zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")

line_num = 0
agencyline = agency_file.readline()
while agencyline:
    if line_num == 0:
        # this is the header, all we care about is the agency_id
        lineparts = agencyline.split(",")
        position = -1
        counter = 0
        for part in lineparts:
            part = part.strip()
            if part == "agency_id":
                position = counter              
        counter += 1
        line_num += 1
        agencyline = agency_file.readline()
    else:
        .....

This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!

标签： python string utf-8

4条回答

叼着烟拽天下

2楼-- · 2019-08-05 18:29

Simple: some of your zip archives are printing the Unicode BOM (Byte Order Mark) at the beginning of the string. This is used to indicate the byte order for use with multi-byte encodings. This means you're reading in a Unicode string (probably UTF-16 encoded) as a bytestring. Easiest thing to do would be check for it at the start of the string and remove it.

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-08-05 18:33

That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string. To remove it you could do this:

import codecs

unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-08-05 18:37

What you've got is a file that may occasionally have a Unicode Byte Order mark at the front of the file. Sometimes this is introduced by editors to indicate encoding.

Here's some details - http://en.wikipedia.org/wiki/Byte_order_mark

Bottom line is that you could look for the \xef\xbb\xbf string which is the marker for UTF-8 encoded data and just strip it. Or the other choice is to open it with the codecs package

with codecs.open('input', 'r', 'utf-8') as file:

or in your case

zipped_feed = ZipFile(feed_name, "r")
# adding a StreamReader around the zipped_feed.open(...)
agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))

0人赞添加讨论(0) 举报

放我归山

5楼-- · 2019-08-05 18:46

Your input file seems to be utf-8 and starting with a 'ZERO WIDTH NO-BREAK SPACE'-character,

import unicodedata
unicodedata.name('\xef\xbb\xbf'.decode('utf8'))
# gives: 'ZERO WIDTH NO-BREAK SPACE'

which is used as a BOM (or more accurately to identify the file as being utf8, as byte order isn't really accurate with utf8, but it's commonly called BOM anyway)

0人赞添加讨论(0) 举报

Python string encodings and ==

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间