A string maybe this
ipath= "./data/NCDC/上海/虹桥/9705626661750dat.txt"
or this
ipath = './data/NCDC/ciampino/6240476818161dat.txt'
How do I know the first string contains chinese?
I find this answer maybe helpful: Find all Chinese text in a string using Python and Regex
but it didn't work out:
import re
ipath= "./data/NCDC/上海/虹桥/9705626661750dat.txt"
re.findall(ur'[\u4e00-\u9fff]+', ipath) # => []
Output:
./data/NCDC/上海/虹桥/9705626661750dat.txt [u'\u4e0a\u6d77', u'\u8679\u6865']
You need to decode the input to make it unicode.
or
According to this question, the range should be
[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]
If you just want to know whether there is a chinese character in your string you don't need
re.findall
, usere.search
and the fact that match objects are truthy.And for those of us who don't care for
re
:Edit: for the full list of Chinese characters this SO link is worth looking at as the range U+4E00..U+9FFF is not complete. What's the complete range for Chinese characters in Unicode?
''
is a bytestring on Python 2. Either addfrom __future__ import unicode_literals
at the top of the module or use unicode literals:u''
:Using these codepoint ranges, we can write an
is_cjk
function:Which we can then use to process text, using functions like
filter
,any
,all
, andmap
to process the text character-by-character, or compose more complex functions:Note that the CJK ranges will include not only Chinese characters but also may include Korean and Japanese characters. For more complex usage, try a dedicated library like
cjklib
.