My problem is that Python, using regex and re.search() doesn't recognize accents even though I use utf-8. Here is my string of code;
#! /usr/bin/python
-*- coding: utf-8 -*-
import re
htmlString = '</dd><dt> Fine, thank you. </dt><dd> Molt bé, gràcies.'
SearchStr = '(\<\/dd\>\<dt\>)+ ([\w+\,\.\s]+)([\&\#\d\;]+)(\<\/dt\>\<dd\>)+ (\w+) (\w+)'
Result = re.search(SearchStr, htmlString)
if Result:
print Result.groups()
passavol23:jO$ catalanword.py
('</dd><dt>', 'Fine, thank you.', ' ', '</dt><dd>', 'Molt', 'b')
So the problem is that it doesn't recognizes the é and thus stops. Any help would be appreciated. Im a Python beginner.
By default,
\w
only matches ascii characters, it translates to[a-zA-Z0-9_]
. And matching UTF-8 bytes using regular expressions is hard enough, let alone only matching word characters, you'd have to match byte ranges instead.You'll need to decode from UTF-8 to
unicode
and use there.UNICODE
flag instead:However, you should really be using a HTML parser to deal with HTML instead. Use BeautifulSoup, for example. It'll handle encoding and Unicode correctly for you.