I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:
\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION
What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.
Is there a way to remove all none latin characters from utf-8 encoded text?
Thanks in advance.
SOLUTION:
import string
textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''
for char in textin:
if char in string.printable:
outtext += char
print(outtext)
my data was decoded to bits for some reason, don't ask me why. :D
While reading the csv file, try to do the encoding as:
what about this:
I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.
(edited to fix a mistake in the code, pointed out by OP)