I get some data from a webpage and read it like this in python
origional_doc = urllib2.urlopen(url).read()
Sometimes this url has characters such as é and ä and ect.
, how could I remove these characters, from the string, right now this is what I am trying,
import unicodedata
origional_doc = ''.join((c for c in unicodedata.normalize('NFD', origional_doc) if unicodedata.category(c) != 'Mn'))
But I get an error
TypeError: must be unicode, not str
using
re
you can sub all characters that are in a certain hexadecimal ascii range.You can also do the inverse and sub anything thats NOT in the basic 128 characters:
This should work. It will eliminate all characters that are not ascii.