How to handle utf8 on the command line (using Perl

2019-04-04 03:27发布

问题:

How can I handle utf8 using Perl (or Python) on the command line?

I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:

$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c   d e f

But with utf8 it doesn't work, of course:

$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5>   <D0> <B7> <D0> <B0>

because it doesn't know about the 2-byte characters.

It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.

回答1:

The "-C" flag controls some of the Perl Unicode features (see perldoc perlrun):

$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е   з а 

To specify encoding used for stdin/stdout you could use PYTHONIOENCODING environment variable:

$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
    print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е   з а 

If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/ regular expression:

$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е   з а 

See Grapheme Cluster Boundaries

In Python \X is supported by regex module.



回答2:

"Hey", I thought, "how difficult could this be in Perl?"

Turns out it's pretty easy. Unfortunately, finding out how took me longer than I thought.

A quick glance at use utf8 showed me that this is now obsolete. Perl's binmode looked promising, but not quite.

Found there's a Perluniintro which lead me to Perlunicode which said I should look at Perlrun. Then, I found what I was looking for.

Perl has a command line switch -C which switches Perl to Unicode. However, the -C command line switch also requires a few options. You need to specify what's in unicode. There's a convenient chart that shows you the various options. It would appear that perl -C by itself would be fine. This combines various options which is equivalent to -CSDL or -C255. However, that means if your LOCALE isn't set to unicode, Perl won't work in Unicode.

Instead, you should use perl -CSD or -perl -C63.

$ echo "одобрение за" | perl -CSD -ne 'my @letters = m/(.)/g; print "@letters\n"'
о д о б р е н и е   з а

Yup, that works.

You can learn quite a bit just answering a question.



回答3:

$ echo "одобрение за" | python -c 'import sys, codecs ; x = codecs.
getreader("utf-8")(sys.stdin); print u", ".join(x.read().strip())'
о, д, о, б, р, е, н, и, е,  , з, а

or if you want unicode codepoints:

$ echo "одобрение за" | python -c 'import sys, codecs ; x = codecs.
getreader("utf-8")(sys.stdin); print u", ".join("<%04x>" % ord(ch) 
for ch in x.read().strip())'
<043e>, <0434>, <043e>, <0431>, <0440>, <0435>, <043d>, <0438>, 
<0435>, <0020>, <0437>, <0430> 


回答4:

I don't know Perl, so I'm answering for Python.

Python doesn't know that the input text is in Unicode. You need to explicitly decode from UTF-8 or whatever it actually is, into Unicode. Then you can use normal Python text processing stuff to process it.

http://docs.python.org/howto/unicode.html

Here's a simple Python 2.x program for you to try:

import sys

for line in sys.stdin:
    u_line = unicode(line, encoding="utf-8")
    for ch in u_line:
        print ch, # print each character with a space after

This copies lines from the standard input, and converts each line to Unicode. The encoding is specified as UTF-8. Then for ch in u_line sets ch to each character. Then print ch, is the easy way in Python 2.x to print a character, followed by a space, with no carriage return. Finally a bare print adds a carriage return.

I still use Python 2.x for most of my work, but for Unicode I would recommend you use Python 3.x. The Unicode stuff is really improved.

Here is the Python 3 version of the above program, tested on my Linux computer.

import sys

assert(sys.stdin.encoding == 'UTF-8')
for line in sys.stdin:
    for ch in line:
        print(ch, end=' ') # print each character with a space after

By default, Python 3 assumes that the input is encoded as UTF-8. By default, Python then decodes that into Unicode. Python 3 strings are always Unicode; there is a special type bytes() used for a string-like object that contains non-Unicode values ("bytes"). This is the opposite of Python 2.x; in Python 2.x, the basic string type was a string of bytes, and a Unicode string was a special new thing.

Of course it isn't necessary to assert that the encoding is UTF-8, but it's a nice simple way to document our intentions and make sure that the default didn't get changed somehow.

In Python 3, print() is now a function. And instead of that somewhat strange syntax of appending a comma after a print statement to make it print a space instead of a newline, there is now a named keyword argument that lets you change the end char.

NOTE: Originally I had a bare print statement after handling the input line in the Python 2.x program, and print() in the Python 3.x program. As J.F. Sebastian pointed out, the code is printing characters from the input line, and the last character will be a newline, so there really isn't a need for the additional print statement.