UTF-8 problems in php: var_export() returns \\0 nu

2019-03-19 00:16发布

问题:

We are dealing with a strange bug in a Joyent Solaris server that never happened before (doesn't happen in localhost or two other Solaris servers with identical php configuration). Actually, I'm not sure if we have to look at php or solaris, and if it is a software or hardware problem...

I just want to post this in case somebody can point us in the right direction.

So, the problem seems to be in var_export()when dealing with strange characters. Executing this in the CLI, we get the expected result in our localhost machines and in two of the servers, but not in the 3rd one. All of them are configured to work with utf-8.

$ php -r "echo var_export('ñu', true);"

Gives this in older servers and localhost (expected):

'ñu'

But in the server we are having problems with (PHP Version => 5.3.6), it adds \0 null characters whenever it encounters an "uncommon" character: è, á, ç, ... you name it.

'' . "\0" . '' . "\0" . 'u'

Any idea on where should be looking at? Thanks in advance.


More info:

  • PHP version 5.3.6.
  • setlocale() is not solving anything.
  • default_charset is UTF-8 in php.ini.
  • mbstring.internal_encoding is set to UTF-8 in php.ini.
  • mbstring.func_overload = 0.
  • this happens in both CLI (example) and web application (php-fpm + nginx).
  • iconv encoding is also UTF-8
  • all files utf-8 encoded.

system('locale') returns:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Some of the tests done so far (CLI):

Normal behaviour:

$ php -r "echo bin2hex('ñu');" => 'c3b175'
$ php -r "echo mb_strtoupper('ñu');" => 'ÑU'
$ php -r "echo serialize(\"\\xC3\\xB1\");" => 's:2:"ñ";'
$ php -r "echo bin2hex(addcslashes(b\"\\xC3\\xB1\", \"'\\\\\"));" => 'c3b1'
$ php -r "echo ucfirst('iñu');" => 'Iñu'

Not normal:

$ php -r "echo strtoupper('ñu');" => 'U' 
$ php -r "echo ucfirst('ñu');" => '?u' 
$ php -r "echo ucfirst(b\"\\xC3\\xB1u\");" => '?u' 
$ php -r "echo bin2hex(ucfirst('ñu'));" => '00b175'
$ php -r "echo bin2hex(var_export('ñ', 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'
$ php -r "echo bin2hex(var_export(b\"\\xC3\\xB1\", 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'

So the problem seems to be in var_export() and "string functions that use the current locale but operate byte-by-byte" Docs (view @hakre's answer).

回答1:

I suggest you verify the PHP binary you've got problems with. Check the compiler flags and the libraries it makes use of.

Normally PHP internally uses binary strings, which means that functions like ucfirst work byte-to-byte and only support what your locale support (if and like configured). See Details of the String TypeDocs.

$ php -r "echo ucfirst('ñu');" 

returns

?u

This makes sense, ñ is

LATIN SMALL LETTER N WITH TILDE (U+00F1)    UTF8: \xC3\xB1

You have some locale configured that makes PHP change \xC3 into something else, breaking the UTF-8 byte-sequence and making your shell display the � replacement characterWikipedia.

I suggest if you really want to analyze the issues, you should start with hexdumps next to how things get displayed in shell and elsewhere. Know that you can explicitly define binrary strings b"string" (that's forward compatibility, mabye you've got enabled some compile flag and you're on unicode experimental?), and also you can write strings literally, here hex-way for UTF-8:

 $ php -r "echo ucfirst(b\"\\xC3\\xB1u\");"

And there are a lot more settings that can play a role, I started to list some points in an answer to Preparing PHP application to use with UTF-8.


Example of a multibyte ucfirst variant:

/**
 * multibyte ucfirst
 *
 * @param string $str
 * @param string|null $encoding (optional)
 * @return string
 */
function mb_ucfirst($str, $encoding = NULL)
{
    $first = mb_substr($str, 0, 1, $encoding);
    $rest = mb_substr($str, 1, strlen($str), $encoding);
    return mb_strtoupper($first, $encoding) . $rest;
}

See mb_strtoupperDocs and as well mb_convert_caseDocs.



回答2:

try force utf-8 in php:

<? ini_set( 'default_charset', 'UTF-8' ); ?>

in very top (first line of code) of your any page/template. It helps me with my special characters mostly. Not sure that it can help you too, try it.



回答3:

Probably all your servers are in good state . In one of the comments you said that you have only issue with ucfirst() and var_export(). Depending on these responses you might be looking at this SOQ. Most of the php string function will not work properly when working with multibyte strings. That is why php has separate set of functions to deal with them.

This might be helpful



回答4:

I normally use utf8_encode('ñu') for all the french characters



回答5:

phpunit tests for this are being added to https://gist.github.com/68f5781a83a8986b9d30 - can we build up a better unit test suite so that we can figure out what the expected output should be?