I`m trying to work with Chinese text and big data in Python. Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:
1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:
When the value printed to the console is looks like:
Mentholatum 曼秀雷敦 男士 深层活炭洁面乳100g(新包装)
So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.
2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
from pprint import pprint
import sys, locale, os
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
nonASCIIregex = re.compile('([^\x00-\x7F])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = startFrom + '(.*)' + endWith
else:
regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
if nonASCIIregex.match(regex):
p = re.compile(ur'' + regex)
else:
p = re.compile(regex)
row[columnName] = p.sub("", columnString).strip()
But the regex does not influence on the given string. I`ve made a test with next code:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
reg = re.compile(ur'((.*))')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩(原男士劲能净爽洁面啫哩)100ml"
print string
string = reg.sub("", string)
print string
And it is work fine for me. The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:
{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "1"
}
}, {
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
}
The Chinese brackets from the file are also viewed like the squares:
I cant find explanation or any solution for such behavior, thus the community help need
Thanks for help.