The perldoc
page for length() tells me that I should use bytes::length(EXPR)
to find a Unicode string in bytes, or and the bytes page echoes this.
use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';
print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";
The output of this script, however, disagrees with the manpage:
ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35
It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?
Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8
, use bytes
, and no bytes
in Perl.
I found that it is possible to use Encode module to influence how the length works.
if $string is utf8 encoded string.
Encode::_utf8_on($string); # the length function will show number of code points after this.
Encode::_utf8_off($string); # the length function will show number of bytes in the string after this.
If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.
This outputs:
The purpose of the
bytes
pragma is to replace thelength
function (and several other string related functions) in the current scope. So every call tolength
in your program is a call to thelength
thatbytes
provides. This is more in line with what you were trying to do:Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is ⓕ (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.