Python - How to remove spaces between Chinese char

2019-05-11 11:07发布

the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:

text = "今天特别 热,但是我买了 3 个西瓜。"

The output I want to get is

text = "今天特别热,但是我买了 3 个西瓜。"

I tried to use Python script and regular expression:

import re
text = re.sub(r'\s(?=[^A-z0-9])','')

However, the result is

text = '今天特别热,但是我买了 3个西瓜。'

So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".

I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!

1条回答
来,给爷笑一个
2楼-- · 2019-05-11 12:09

I understand the spaces you need to remove reside in between letters.

Use

re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text)

Details:

  • (?<=[^\W\d_]) - a positive lookbehind requiring a Unicode letter immediately to the left of the current location
  • \s+ - 1+ whitespaces (remove + if only one is expected)
  • (?=[^\W\d_]) - a positive lookahead that requires a Unicode letter immediately to the right of the current location.

You do not need re.U flag since it is on by default in Python 3. You need it in Python 2 though.

You may also use capturing groups:

re.sub(r'([^\W\d_])\s+([^\W\d_])', r'\1\2', text)

where the non-consuming lookarounds are turned into consuming capturing groups ((...)). The \1 and \2 in the replacement pattern are backreferences to the capturing group values.

See a Python 3 online demo:

import re
text = "今天特别 热,但是我买了 3 个西瓜。"
print(re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text))
// => 今天特别热,但是我买了 3 个西瓜。
查看更多
登录 后发表回答