Python string encodings and ==

2019-08-05 17:55发布

I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).

I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:

agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en

The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:

zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")

line_num = 0
agencyline = agency_file.readline()
while agencyline:
    if line_num == 0:
        # this is the header, all we care about is the agency_id
        lineparts = agencyline.split(",")
        position = -1
        counter = 0
        for part in lineparts:
            part = part.strip()
            if part == "agency_id":
                position = counter              
        counter += 1
        line_num += 1
        agencyline = agency_file.readline()
    else:
        .....

This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!

4条回答
叼着烟拽天下
2楼-- · 2019-08-05 18:29

Simple: some of your zip archives are printing the Unicode BOM (Byte Order Mark) at the beginning of the string. This is used to indicate the byte order for use with multi-byte encodings. This means you're reading in a Unicode string (probably UTF-16 encoded) as a bytestring. Easiest thing to do would be check for it at the start of the string and remove it.

查看更多
beautiful°
3楼-- · 2019-08-05 18:33

That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string. To remove it you could do this:

import codecs

unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))
查看更多
Deceive 欺骗
4楼-- · 2019-08-05 18:37

What you've got is a file that may occasionally have a Unicode Byte Order mark at the front of the file. Sometimes this is introduced by editors to indicate encoding.

Here's some details - http://en.wikipedia.org/wiki/Byte_order_mark

Bottom line is that you could look for the \xef\xbb\xbf string which is the marker for UTF-8 encoded data and just strip it. Or the other choice is to open it with the codecs package

with codecs.open('input', 'r', 'utf-8') as file: 

or in your case

zipped_feed = ZipFile(feed_name, "r")
# adding a StreamReader around the zipped_feed.open(...)
agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))
查看更多
放我归山
5楼-- · 2019-08-05 18:46

Your input file seems to be utf-8 and starting with a 'ZERO WIDTH NO-BREAK SPACE'-character,

import unicodedata
unicodedata.name('\xef\xbb\xbf'.decode('utf8'))
# gives: 'ZERO WIDTH NO-BREAK SPACE'

which is used as a BOM (or more accurately to identify the file as being utf8, as byte order isn't really accurate with utf8, but it's commonly called BOM anyway)

查看更多
登录 后发表回答