How could I catch the "Unicode non-character 0xffff is illegal for interchange"-warning?
#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;
use warnings FATAL => qw(all);
my $character;
try {
$character = "\x{ffff}";
} catch {
die "---------- caught error ----------\n";
};
say "something";
Output:
# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.
It's a compile-time error, similar to forgetting to close a regex. If you delay the compilation of that piece to runtime, you can catch it:
Output:
If you remove the
q
aftereval
, you'll get the same behavior as your script does now, sinceeval {...}; if($@) {...}
is the same astry {...} catch {...};
, but with theq
it's an eval of a string, which is totally different.UPDATE:
As Tom points out, you should probably just disable that warning with
no warnings qw(utf8)
in a narrow scope around the spot you're setting or getting those kinds of values. You may still want to catch utf8 warnings as errors on output (or anything else that sends the data outside your program):Output:
A Perl 5.10.0 ⋯ 5.13.8 Bug
I’m going to assume that you don’t actually want to “catch” this warning, but rather to survive or ignore it. If you really want to catch it, well, there may be easier ways to do that.
But the first thing to know is that there is no such thing as an illegal code point, only code points not valid for interchange.
You just have to use a
no warnings "utf8"
for the scope of where you need to use the full Unicode range (or more). There is no need to use aneval
for this. All it takes is a scoped warning suppression. Even that it is unnecessary on newer perls.So instead of this:
write (on older perls):
This is also the situation with pattern matches involving such a character:
will cause a warning or a fatal, depending on how old your perl, or nothing at all, depending on how new your perl is.
You can disable utf8-related warnings only on releases where it matters this way:
‘Fixed in the Next Release’
The really interesting thing is that they (read: Perl5 Porters, and in particular, Karl Williamson) have fixed the bug that requires a
no warnings "utf8"
guard just to work with any code point at all. It is only the output where you may have to be careful. Watch:The safest thing to do is put
no warnings "utf8"
in just the places you need it. But there is no need of aneval
!As of 5.13.10, and hence in 5.14, there are three subcategories of utf8 warnings:
surrogate
for UTF‑16,nonchar
as described below, andnon_unicode
for supers, also defined below.An All‐Perl Interchange is Safe
You probably don’t want to suppress the “illegal for interchange” warnings on output, though, because this is true. Well, unless you’re using Perl’s
"utf8"
encoding, which isn’t the same as its"UTF‑8"
encoding, oddly enough. The"utf8"
encoding is laxer than the formal standard, because it allows us to do more interesting things than we otherwise could.However, if and only if you have a 100% pure-perl datapath, you can still use any code point you want, including non-unicode code points up to ᴍᴀxɪɴᴛ. That’s 0x7FFF_FFFF on 32‑bit machines, and something unspeakably huge on 64‑bit machines: 0xFFFF_FFFF_FFFF_FFFF! That’s not just a super; it’s a hypermega!
Note that on a 32‑bit machine, that last one produces this:
Varieties of Noncharacters Illegal for Interchange
There are several — quite a few, actually — different classes of code points that are not legal for interchange.
Any code point such that
(ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE
is true. This covers the last two code points in all possible planes. As it spans 17 planes, Unicode defines therefore 34 such code points. Those are not characters, although they are Unicode code points. Let’s call these the Penults. They fall under thenonchar
warning class on 5.13.10 or better.The 32 code points starting at U+FDD0. These are guaranteed to be Noncharacters, although of course they are still Unicode code points. Like the previous penult set, these too fall under the
nonchar
warning class on 5.13.10 or better.The 1024 high surrogates and the 1024 low surrogates, which were carved out as slop to make UTF‑16 possible for all those dumb systems that tried UCS‑2 instead of UTF‑8 or UTF‑32. This cripples the range of valid Unicode code points, restricting them to only the first 21 bits worth. SURROGATES ARE STILL CODE POINTS. They just are not valid for interchange, because they cannot always be correctly represented by brain-dead-clever UTF‑16. Under 5.13.10 or better, these are controlled by the
surrogate
warning subclass.Beyond that, we’re now above the Unicode range. I’ll call these Supers. On a 32‑bit machine, you still have (10 or) 11 bits of them beyond the standard 21 bits that Unicode gives you. Perl can use these just fine. That gives 2**32 total code points you can use in your Perl program (well, or 2**31 at least, due to signed overflow). You get a million Unicode code points, but then you get a couple of billion Super code points beyond those that you can use in Perl. If you are running 5.13.10 or better, you can control access to these via the
non_unicode
warnings subclass.Perl still follows the rules about Penults even up in the Super range. There are 480 such Superpenults on a 32‑bit machine, and rather more of them on a 64‑bit one.
If you really want to play it nonportably, then if you have native 64‑bit ints, you have another 32 or 33 bits above what the supers give you. You now have 18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand and 616 characters. You have a whole exabyte of distinct code points! That’s far beyond super that I’m going to call them Hypermegas. Ok, so these aren’t very portable, since they require a truly 64‑bit platform. They’re a bit foreign, so maybe we should write that Ὑπέρμεγας to scare people away. :) Note that the rules against penults still apply to hypermegas.
The Test Program
I wrote a little program that proves that these code points are cool.
NOTE: That last line above shows a Yet Another Stupid Bug in SO’s infernal highlighting code. Notice the last WɪᴋɪWᴏʀᴅ up there, the
\p{Greek}
one, got left out of the colorization scheme? That means they are only looking for capitalized ASCII identifiers. Très passé! Why bother accepting ᴜɴɪᴄᴏᴅᴇ if you aren’t going to use things like\p{Uppercase}
correctly? As you’ll see in my program where I have a@ὑπέρμεγας
array, us ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇs handle this perfectly fine. ☺I obviously didn’t run all the supers or the hypers. And on 32‑bit machine, you’ll only get 4 of the tested hypers. I also didn’t test any of the hyperpenults.
Here’s the testing program, which runs cleanly on all version from 5.10 and up.