I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:
\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION
What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.
Is there a way to remove all none latin characters from utf-8 encoded text?
Thanks in advance.
SOLUTION:
import string
textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''
for char in textin:
if char in string.printable:
outtext += char
print(outtext)
my data was decoded to bits for some reason, don't ask me why. :D