UCS2/HexEncoded characters to UTF8 in php

2019-05-12 07:21发布

I asked a question previously to get a UCS-2/HexEncoded string from UTF-8, and I got some help from some guys at the following link.

UCS2/HexEncoded characters

But now I need to get the correct UTF-8 from a UCS-2/HexEncoded string in PHP.

For the following strings:

00480065006C006C006F will return 'Hello'

06450631062d0628064b06270020063906270644064500200021 will return (!مرحبا عالم) in arabic

标签: php utf-8 ucs2
2条回答
Luminary・发光体
2楼-- · 2019-05-12 07:41

A more accurate conversion of UCS-2 to UTF-8

function ucs2_to_utf8($h)
{
    if (!is_string($h)) return null;
    $r='';
    for ($a=0; $a<strlen($h); $a+=4) { $r.=chr(hexdec($h{$a}.$h{($a+1)}.$h{($a+2)}.$h{($a+3)})); }
    return $r;
}

The problem on selected answer is it was divided by 2 instead of 4 which would cause converting 00 as null and will cause this � to appear when it is used on html attributes values like title="" or alt=""

查看更多
小情绪 Triste *
3楼-- · 2019-05-12 07:56

You can recompose a Hex-representation by converting the hexadecimal chars with hexdec(), repacking the component chars, and then using mb_convert_encoding() to convert from UCS-2 into UTF-8. As I mentioned in my answer to your other question, you'll still need to be careful with the output encoding, although here you've specifically requested UTF-8, so we'll use that for the upcoming sample.

Here's a sample that does the work of converting UCS-2 in Hex to UTF-8 in native string form. As PHP currently doesn't ship with a hex2bin() function, which would make things very easy, we'll use the one posted at the reference link at the end. I've renamed it to local_hex2bin() just in case it conflicts with a future version of PHP or with a definition in some other 3rd party code that you include in your project.

<?php
function local_hex2bin($h)
{
if (!is_string($h)) return null;
$r='';
for ($a=0; $a<strlen($h); $a+=2) { $r.=chr(hexdec($h{$a}.$h{($a+1)})); }
return $r;
};

header('Content-Type: text/html; charset=UTF-8');
mb_http_output('UTF-8');
echo '<html><head>';
echo '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />';
echo '</head><body>';
echo 'output encoding: '.mb_http_output().'<br />';
$querystring = $_SERVER['QUERY_STRING'];
// NOTE: we could substitute one of the following:
// $querystring = '06450631062d0628064b06270020063906270644064500200021';
// $querystring = '00480065006C006C006F';
$ucs2string = local_hex2bin($querystring);
// NOTE: The source encoding could also be UTF-16 here.
// TODO: Should check byte-order-mark, if available, in case
//       16-bit-aligned bytes are reversed.
$utf8string = mb_convert_encoding($ucs2string, 'UTF-8', 'UCS-2');
echo 'query string: '.$querystring.'<br />';
echo 'converted string: '.$utf8string.'<br />';
echo '</body>';
?>

Locally, I called this sample page UCS2HexToUTF8.php, and then used a querystring to set the output.

UCS2HexToUTF8.php?06450631062d0628064b06270020063906270644064500200021
--
encoding: UTF-8
query string: 06450631062d0628064b06270020063906270644064500200021
converted string: مرحبًا عالم !

UCS2HexToUTF8.php?00480065006C006C006F
--
output encoding: UTF-8
query string: 00480065006C006C006F
converted string: Hello

Here's the link to the original source of the hex2bin() function.
PHP: bin2hex(), comment #86123 @ php.net

Also, as noted in my comments before the call to mb_convert_encoding(), you'll probably want to try and detect which endian ordering is in use by the source, especially if your application has parts where one or more CPUs on one server differ from the rest by orientation.

Here's a link that can help you identify the byte-order marks (BOM).
Byte order mark @ Wikipedia

查看更多
登录 后发表回答