the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:
text = "今天特别 热,但是我买了 3 个西瓜。"
The output I want to get is
text = "今天特别热,但是我买了 3 个西瓜。"
I tried to use Python script and regular expression:
import re
text = re.sub(r'\s(?=[^A-z0-9])','')
However, the result is
text = '今天特别热,但是我买了 3个西瓜。'
So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".
I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!
I understand the spaces you need to remove reside in between letters.
Use
re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text)
Details:
(?<=[^\W\d_])
- a positive lookbehind requiring a Unicode letter immediately to the left of the current location
\s+
- 1+ whitespaces (remove +
if only one is expected)
(?=[^\W\d_])
- a positive lookahead that requires a Unicode letter immediately to the right of the current location.
You do not need re.U
flag since it is on by default in Python 3. You need it in Python 2 though.
You may also use capturing groups:
re.sub(r'([^\W\d_])\s+([^\W\d_])', r'\1\2', text)
where the non-consuming lookarounds are turned into consuming capturing groups ((...)
). The \1
and \2
in the replacement pattern are backreferences to the capturing group values.
See a Python 3 online demo:
import re
text = "今天特别 热,但是我买了 3 个西瓜。"
print(re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text))
// => 今天特别热,但是我买了 3 个西瓜。