Filtering text encoded with utf-8 to only contain

2019-07-29 06:13发布

问题:

I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

Is there a way to remove all none latin characters from utf-8 encoded text?

Thanks in advance.

SOLUTION:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

my data was decoded to bits for some reason, don't ask me why. :D

回答1:

what about this:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(edited to fix a mistake in the code, pointed out by OP)



回答2:

While reading the csv file, try to do the encoding as:

df=pd.read_csv('D:/sample.csv',encoding="utf-8-sig")