Work with Chinese in Python

2019-07-24 12:22发布

问题:

I`m trying to work with Chinese text and big data in Python. Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:

1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:

When the value printed to the console is looks like:

Mentholatum 曼秀雷敦 男士 深层活炭洁面乳100g(新包装)

So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.

2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:

#!/usr/bin/env python
# -*- coding: utf-8 -*

import re
from pprint import pprint 
import sys, locale, os

    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    nonASCIIregex = re.compile('([^\x00-\x7F])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)

    if isInclude:
        regex = startFrom + '(.*)' + endWith
    else:
        regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
    if nonASCIIregex.match(regex):
        p = re.compile(ur'' + regex)
    else:
        p = re.compile(regex)
    row[columnName] = p.sub("", columnString).strip()

But the regex does not influence on the given string. I`ve made a test with next code:

#!/usr/bin/env python
# -*- coding: utf-8 -*
import re

reg = re.compile(ur'((.*))')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩(原男士劲能净爽洁面啫哩)100ml"
print string
string = reg.sub("", string)
print string

And it is work fine for me. The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:

{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "1"
                }
            }, {
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            }

The Chinese brackets from the file are also viewed like the squares:

I cant find explanation or any solution for such behavior, thus the community help need

Thanks for help.

回答1:

The problem is that the text you're reading in isn't getting understood as Unicode correctly (this is one of the big gotchas that prompted sweeping changes for Python 3k). Instead of:

data_file = myfile.read()

You need to tell it to decode the file:

data_file = myfile.read().decode("utf8")

Then continue with json.loads, etc, and it should work out fine. Alternatively,

data = json.load(myfile, "utf8")


回答2:

After many searches and consultations here is a solution for Chinese text (also mixed and non-mixed language)

import codecs
def betweencase(valuestoremove, row, columnName):
    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)
    if isInclude:
        regex = ur'' + startFrom + '(.*)' + endWith
    else:
        regex = ur'(?<=' + startFrom + ').*?(?=' + endWith + ')'

    ***p = re.compile(codecs.encode(unicode(regex), "utf-8"))***
    delimiter = ' '
    if localization == 'CN':
        delimiter = ''

    row[columnName] = p.sub(delimiter, columnString).strip()

As you can see we encode any regex to utf-8 thus the postgresql db value is match to regex.