How can I handle utf8 using Perl (or Python) on the command line?
I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:
$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c d e f
But with utf8 it doesn't work, of course:
$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5> <D0> <B7> <D0> <B0>
because it doesn't know about the 2-byte characters.
It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.
"Hey", I thought, "how difficult could this be in Perl?"
Turns out it's pretty easy. Unfortunately, finding out how took me longer than I thought.
A quick glance at use utf8 showed me that this is now obsolete. Perl's binmode looked promising, but not quite.
Found there's a Perluniintro which lead me to Perlunicode which said I should look at Perlrun. Then, I found what I was looking for.
Perl has a command line switch
-C
which switches Perl to Unicode. However, the-C
command line switch also requires a few options. You need to specify what's in unicode. There's a convenient chart that shows you the various options. It would appear thatperl -C
by itself would be fine. This combines various options which is equivalent to-CSDL
or-C255
. However, that means if your LOCALE isn't set to unicode, Perl won't work in Unicode.Instead, you should use
perl -CSD
or-perl -C63
.Yup, that works.
You can learn quite a bit just answering a question.
or if you want unicode codepoints:
I don't know Perl, so I'm answering for Python.
Python doesn't know that the input text is in Unicode. You need to explicitly decode from UTF-8 or whatever it actually is, into Unicode. Then you can use normal Python text processing stuff to process it.
http://docs.python.org/howto/unicode.html
Here's a simple Python 2.x program for you to try:
This copies lines from the standard input, and converts each line to Unicode. The encoding is specified as UTF-8. Then
for ch in u_line
setsch
to each character. Thenprint ch,
is the easy way in Python 2.x to print a character, followed by a space, with no carriage return. Finally a bareprint
adds a carriage return.I still use Python 2.x for most of my work, but for Unicode I would recommend you use Python 3.x. The Unicode stuff is really improved.
Here is the Python 3 version of the above program, tested on my Linux computer.
By default, Python 3 assumes that the input is encoded as UTF-8. By default, Python then decodes that into Unicode. Python 3 strings are always Unicode; there is a special type
bytes()
used for a string-like object that contains non-Unicode values ("bytes"). This is the opposite of Python 2.x; in Python 2.x, the basic string type was a string of bytes, and a Unicode string was a special new thing.Of course it isn't necessary to assert that the encoding is UTF-8, but it's a nice simple way to document our intentions and make sure that the default didn't get changed somehow.
In Python 3,
print()
is now a function. And instead of that somewhat strange syntax of appending a comma after a print statement to make it print a space instead of a newline, there is now a named keyword argument that lets you change the end char.NOTE: Originally I had a bare
print
statement after handling the input line in the Python 2.x program, andprint()
in the Python 3.x program. As J.F. Sebastian pointed out, the code is printing characters from the input line, and the last character will be a newline, so there really isn't a need for the additional print statement.The "-C" flag controls some of the Perl Unicode features (see
perldoc perlrun
):To specify encoding used for stdin/stdout you could use
PYTHONIOENCODING
environment variable:If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use
/\X/
regular expression:See Grapheme Cluster Boundaries
In Python
\X
is supported byregex
module.