Python - How to remove spaces between Chinese char

the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:

text = "今天特别 热，但是我买了 3 个西瓜。"

The output I want to get is

text = "今天特别热，但是我买了 3 个西瓜。"

I tried to use Python script and regular expression:

import re
text = re.sub(r'\s(?=[^A-z0-9])','')

However, the result is

text = '今天特别热，但是我买了 3个西瓜。'

So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".

I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!

标签： python regex space

1条回答

来，给爷笑一个

2楼-- · 2019-05-11 12:09

I understand the spaces you need to remove reside in between letters.

Use

re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text)

Details:

(?<=[^\W\d_]) - a positive lookbehind requiring a Unicode letter immediately to the left of the current location
\s+ - 1+ whitespaces (remove + if only one is expected)
(?=[^\W\d_]) - a positive lookahead that requires a Unicode letter immediately to the right of the current location.

You do not need re.U flag since it is on by default in Python 3. You need it in Python 2 though.

You may also use capturing groups:

re.sub(r'([^\W\d_])\s+([^\W\d_])', r'\1\2', text)

where the non-consuming lookarounds are turned into consuming capturing groups ((...)). The \1 and \2 in the replacement pattern are backreferences to the capturing group values.

See a Python 3 online demo:

import re
text = "今天特别 热，但是我买了 3 个西瓜。"
print(re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text))
// => 今天特别热，但是我买了 3 个西瓜。

0人赞添加讨论(0) 举报

Python - How to remove spaces between Chinese char

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间