How do I get a regular expression to recognize non

2019-04-10 13:04发布

问题:

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.

My problem is that when I print the information the öäå are gone.

I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)

EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

回答1:

Always work in unicode and only convert to an encoded representation when necessary.

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå


回答2:

It would help if you could dump the strings before and after each step.

Check your value of re.UNICODE first, see this