Python: Comparing strings with accented characters

I'm quite new to python. I am trying to remove files that appear on one list from another list. The lists were produced by redirecting ll -R on mac and on windows (but have gone some processing since - merging, sorting, etc - using other python scripts). Some file names have accents and special symbols. These strings, even though they are the same (printed the same and look the same in the files that contain the lists) are found to be not equal.

I found the thread about how to compare strings with special characters in unicode: Python String Comparison--Problems With Special/Unicode Characters This is quite similar to my problem. I did some more reading on encoding and how to change the encoding of strings. However, I tried all codecs I could find in the codecs documentation: https://docs.python.org/2/library/codecs.html For all possible pairs of codecs the two strings are not equal (see program below - tried both decode and encode options).

When I go over the characters in the two strings one by one the accented e appears as an accented e (one char) in one file and as two chars (e and printable-as-space) in the other.

Any ideas would be appreciated.

I narrowed down the two text files to one line one word each (obviously with an accent). I uploaded the text files to dropbox: testfilesindata and testmissingfiles (but haven't tried to download a fresh copy from dropbox).

Many thanks!

PS. Sorry about messing with the links. I don't have reputation 10 ...

#!/usr/bin/python3

import sys

codecs = [ 'ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp720      ', 'cp737   ', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856   ', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874     ', 'cp875   ', 'cp932', 'cp949', 'cp950', 'cp1006   ', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r   ', 'koi8_u      ', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig' ]

file1 = open('testmissingfiles','r')
file2 = open('testfilesindata','r')

list1 = file1.readlines()
list2 = file2.readlines()

word1 = list1[0].rstrip('\n')
word2 = list2[0].rstrip('\n')

for i in range(0,len(codecs)-1):
    for j in range(0,len(codecs)-1):
        try:
            encoded1 = word1.decode(codecs[i])
            encoded2 = word2.decode(codecs[j])

            if encoded1 == encoded2:
                sys.stdout.write('Succeeded with ' + codecs[i] + ' & ' + codecs[j] + '\n')
        except:
            pass

标签： python string unicode encoding comparison

2条回答

女痞

2楼-- · 2019-05-29 07:43

Use unicodedata.normalize to normalize the to strings to the same normal form:

import unicodedata

encoded1 = unicodedata.normalize('NFC', word1.decode('utf8'))
encoded2 = unicodedata.normalize('NFC', word2.decode('utf8'))

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-05-29 07:43

You have a few problems with your program:

You program will generate an AttributeError exception and consequently pass in every loop. Neither word1 nor word2 have a method called .decode(). In Python3, you can encode a string into a sequence of bytes, or you can decode a sequence of bytes into a string.

The use of codecs is a red herring. Both of your input files are UTF-8 encoded. The bytes from the file are successfully decoded when you read them from the file.

Your strings are similar in appearance, but are composed of different unicode code points. Specifically, "Adhésion" includes the two unicode code points 0065 and 0301, "LATIN SMALL LETTER E" and "COMBINING ACUTE ACCENT". On the other hand, the 2nd word, "Adhésion" contains the single code point 00E9, "LATIN SMALL LETTER E WITH ACUTE". As Daniel points out in his answer, you can check for the semantic equivalence of these distinct strings by normalizing them first.

Here is how I would solve your problems:

#!/usr/bin/python3

import sys
import unicodedata

with open('testmissingfiles', 'r') as fp:
    list1 = [line.strip() for line in fp]
with open('testfilesindata','r') as fp:
    list2 = [line.strip() for line in fp]

word1 = list1[0]
word2 = list2[0]

if word1 == word2:
    print("%s and %s are identical"%(word1, word2))
elif unicodedata.normalize('NFC', word1) == unicodedata.normalize('NFC', word2):
    print("%s and %s look the same, but use different code poitns"%(word1, word2))
else:
    print("%s and %s are unrelated"%(word1, word2))

0人赞添加讨论(0) 举报

Python: Comparing strings with accented characters

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间