In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad'
, and '\x0c'
. Now, I just need to remove these characters, as well some spaces and the page numbers.
Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'
. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?
Here is my current usage, which does not do what I want:
re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)