How do I get a regular expression to recognize non

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.

My problem is that when I print the information the öäå are gone.

I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)

EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

标签： python regex utf-8 character-encoding ascii

2条回答

祖国的老花朵

2楼-- · 2019-04-10 13:05

It would help if you could dump the strings before and after each step.

Check your value of re.UNICODE first, see this

0人赞添加讨论(0) 举报

劫难

3楼-- · 2019-04-10 13:10

Always work in unicode and only convert to an encoded representation when necessary.

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå

0人赞添加讨论(0) 举报

How do I get a regular expression to recognize non

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间