I decided to post a question, after spending quite some time and still not figuring out the problem. Also read a bunch of seemingly related posts, none really fit my simple (?) problem.
So I have a possibly large text file (>1000 lines) that contains Mandarin Chinese chars, with a sample line like:
"ref#2-5-1.jpg#2#一些 <variable> 内容#pic##" (the Chinese just means "some content").
All that needs to be modified is that a space should be inserted between each character, if there is not one already:
"ref#2-5-1.jpg#2#一 些 <variable> 内 容#pic##".
I started naively with straightforward stuff like the following, but there is no match at all:
sed -e 's/\([\u4E00-\u9fff]\)/\1 /g' <test_utf_sed.txt > test_out.txt
where 4E00-9fff are supposed to be the code range for Mandarin Chinese. Unamazingly, this has not worked, so I also had wanted to try
sed -e 's/\([一-龻]\)/hello/g' <test_utf_sed.txt > test_out.txt
This failed because my bash cannot display (?) the "一" character.
Then I did some basic test, which failed as well:
sed -e 's/\(\u4E00\)/hello/g' <test_utf_sed.txt > test_out.txt //一
sed -e 's/\(\u4E9B\)/hello/g' <test_utf_sed.txt > test_out.txt //些
Same with another notation for utf encoding (found here on stackoverflow):
sed -e 's/\(\u'U+4E00\)/hello/g' <test_utf_sed.txt > test_out.txt
1) As tool for dealing with double byte chars, is sed the right choice at all?
2) Is sed able to handle unicode at all, or do I need a special switch?
3) I am not looking for a workaround solution like this:
step1: insert space after each character
//like 's/\(.\)/\1 /g')
step2: remove space after each chacter which is not a Chinese character
//like 's/\([a-zA-Z0-9]\) /\1/g')
I know how to do this but it is unelegant and error-prone. This must be possible using utf-8 in regex in sed.
4) My environment is bash-3.2 on a MacOS 10.6.8 (oldish OS).
5) If you know of any pointers to some open regEx-onliners as library dealing with Chinese text or language processing, it would be great to share.
Thanks a lot in advance, your help is much appreciated!
sed
doesn't understand\u
escape sequences (apparently). I don't know if bash-3.2 does either, but I think it does; if so, you could writebut you still wouldn't be able to do the range specification.
However, by translating to UTF-8 by hand, you could arrive at the following extended regular expression which will, I believe, match any UTF-8 sequence for a character in the range U+4E00...U+9FFF:
(But the character ranges will only work if you invoke
sed
in a single-byte locale, preferably theC
locale.)With GNU
sed
, you get extended regular expressions if you provide the-r
flag. With MacOSX I believe you need the-E
flag. So you could try:(The above lets bash handle the
\x
escapes. If you leave out the$
, thensed
will handle the\x
escapes, but you'll have to change the substitution from\\1
to\1
. I don't have a Mac, nor do have the old version of bash, so I really don't know whether yoursed
does hex escapes or not; I'm pretty sure that your bash will, but I can't guarantee it.)By the way, it's not that difficult to get the utf-8 encodings for those characters; I did it with a little copy-and-paste from the original post. Eg.:
It helps to know that the entire range of plane 0 ideographs (U+4E00...U+9FFF) have three-byte codes, so that 一 is
E4 B8 80
and 些 isE4 BA 9B
. (The0A
is, of course, a line-end.)Perl has pretty good support for dealing with Unicode. That might be a better bet for your task than sed. This one-liner works like your first sed example:
The
-CIOED
tells perl to do its I/O in utf8.-p
runs the given code once for each line of the input file, then prints the result.-e
specifies a line of Perl code to run. See the documentation on command-line arguments for more.The regular expression uses named ranges to identify the characters to match.
You might also want to read the Perl Unicode documentation.