Filtering text encoded with utf-8 to only contain

2019-07-29 05:44发布

I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

Is there a way to remove all none latin characters from utf-8 encoded text?

Thanks in advance.

SOLUTION:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

my data was decoded to bits for some reason, don't ask me why. :D

2条回答
甜甜的少女心
2楼-- · 2019-07-29 06:21

While reading the csv file, try to do the encoding as:

df=pd.read_csv('D:/sample.csv',encoding="utf-8-sig")
查看更多
乱世女痞
3楼-- · 2019-07-29 06:26

what about this:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(edited to fix a mistake in the code, pointed out by OP)

查看更多
登录 后发表回答