Filtering text encoded with utf-8 to only contain

2019-07-29 05:44发布

I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

Is there a way to remove all none latin characters from utf-8 encoded text?

Thanks in advance.

SOLUTION:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

my data was decoded to bits for some reason, don't ask me why. :D

标签： python encoding utf-8

2条回答

甜甜的少女心

2楼-- · 2019-07-29 06:21

While reading the csv file, try to do the encoding as:

df=pd.read_csv('D:/sample.csv',encoding="utf-8-sig")

0人赞添加讨论(0) 举报

乱世女痞

3楼-- · 2019-07-29 06:26

what about this:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(edited to fix a mistake in the code, pointed out by OP)

0人赞添加讨论(0) 举报

Filtering text encoded with utf-8 to only contain

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间