UTF-8 to Unicode Code Points

2019-01-06 21:47发布

Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?

ie the German word "tchüß" would be rendered as something like "tch\20AC\21AC" (please note that I am making the Unicode codes up).

EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:

function strToHex ($string)
{
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
    {
        $id = ord (mb_substr ($string, $i, 1, "utf-8"));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
}

    return ($hex);
}

Any ideas?

EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032

8条回答
贼婆χ
2楼-- · 2019-01-06 22:15

I guess you're going to print out your strings on a website?

I'm storing all my databases in uft8, using html_entities($string) before output.

Maybe you have to try html_entities(utf8_encode($string));

查看更多
别忘想泡老子
3楼-- · 2019-01-06 22:16

With PHP 7, there is a new IntlChar::ord() to find the Unicode Code Point from a given UTF-8 character:

var_dump(sprintf('U+%04X', IntlChar::ord('ß')));

# Outputs: string(6) "U+00DF"
查看更多
唯我独甜
4楼-- · 2019-01-06 22:16

I once created a function called _convert() which encodes safely everything to UTF-8.

查看更多
来,给爷笑一个
5楼-- · 2019-01-06 22:16

Tested on php 5.6

/**
 * @param string $utf8char
 * @return string
 */
function toUnicodeCodePoint($utf8char)
{
    return 'U+' . dechex(mb_ord($utf8char));
}

/**
 * @see https://github.com/symfony/polyfill-mbstring
 * @param string $s
 * @return int
 */
function mb_ord($s)
{
    $code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
    if (0xF0 <= $code) {
        return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
    }
    if (0xE0 <= $code) {
        return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
    }
    if (0xC0 <= $code) {
        return (($code - 0xC0) << 6) + $s[2] - 0x80;
    }

    return $code;
}

echo toUnicodeCodePoint('                                                                    
查看更多
不美不萌又怎样
6楼-- · 2019-01-06 22:23

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

查看更多
Viruses.
7楼-- · 2019-01-06 22:25

For people looking to find the Unicode Code Point for any character this might be useful. You can then encode the string in whatever you want, replacing certain characters with escape codes, and leaving others in their binary form (eg. ascii printable characters), depending on the context in which you want to use it.

From: Mapping codepoints to Unicode encoding forms

The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself.

/**
 * Convert a string into an array of decimal Unicode code points.
 *
 * @param $string   [string] The string to convert to codepoints
 * @param $encoding [string] The encoding of $string
 * 
 * @return [array] Array of decimal codepoints for every character of $string
 */
function toCodePoint( $string, $encoding )
{
    $utf32  = mb_convert_encoding( $string, 'UTF-32', $encoding );
    $length = mb_strlen( $utf32, 'UTF-32' );
    $result = [];


    for( $i = 0; $i < $length; ++$i )

        $result[] = hexdec( bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ) );


    return $result;
}
查看更多
登录 后发表回答