the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:
text = "今天特别 热,但是我买了 3 个西瓜。"
The output I want to get is
text = "今天特别热,但是我买了 3 个西瓜。"
I tried to use Python script and regular expression:
import re
text = re.sub(r'\s(?=[^A-z0-9])','')
However, the result is
text = '今天特别热,但是我买了 3个西瓜。'
So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".
I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!
I understand the spaces you need to remove reside in between letters.
Use
Details:
(?<=[^\W\d_])
- a positive lookbehind requiring a Unicode letter immediately to the left of the current location\s+
- 1+ whitespaces (remove+
if only one is expected)(?=[^\W\d_])
- a positive lookahead that requires a Unicode letter immediately to the right of the current location.You do not need
re.U
flag since it is on by default in Python 3. You need it in Python 2 though.You may also use capturing groups:
where the non-consuming lookarounds are turned into consuming capturing groups (
(...)
). The\1
and\2
in the replacement pattern are backreferences to the capturing group values.See a Python 3 online demo: