I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location)
to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
It would help if you could dump the strings before and after each step.
Check your value of
re.UNICODE
first, see thisAlways work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the
re.U
flag so\w
matches unicode letters: