UTF8 to equivalent number in php

2019-07-28 11:19发布

问题:

I've been searching my !!! off trying to find a PHP function to convert UTF8 to the equivalent number. I'm not entirely sure what to call the number (I heard its called an ordinate?) but heres an example: http://jrgraphix.net/r/Unicode/3040-309F

Basically I'm trying to read a UTF-8 .txt file in PHP and then save every line in an array, so I can mess around with it.

If anyone can assist me with this it would be highly appreciated, as I am not that familiar with UTF8 yet.

Edit: This is what I've got so far:

echo "var TextCharacters = new Array();\n";

$LineArray = array();
$file_handle = fopen("lesson1.txt", "r");


while (!feof($file_handle)) 
{
  $line_of_text = fgets($file_handle);  
  array_push($LineArray, $line_of_text);
}

fclose($file_handle);

foreach($LineArray as $s)
{
    for($i = 0; $i < mb_strlen($s,"utf-8"); $i++)
    {
        $char = mb_substr($s, $i, 1, "utf-8");
        echo "alert(go(" . bin2hex(iconv('UTF-8', 'UCS-2', $char)) . "));";         
    }
}

回答1:

What you're looking for is the Unicode code point, i.e. the numeric identifier by which the character is known in the Unicode character table. The "cheapest" way to do this is through the UCS-2 character encoding, which maps 1:1 from bytes unto the Unicode code points:

echo bin2hex(iconv('UTF-8', 'UCS-2', 'あ'));
// 3042

Caveats: the returned code is always 4 hexadecimal digits long (which you may or may not like) and UCS-2 does not support characters higher than the BMP, i.e. higher than code point FFFF.



回答2:

There is nothing magic about UTF-8 in PHP. When you read the file, you'll get the byte values (and not parsed as characters). Iterate of the data you've read and use ord() to get the decimal value of the byte.

If you want to do this with UTF-8 code points, you can use either mb_substr or iconv_substr to extract each character before using ord() to print the value of each byte that makes up the character.

Update: To expand with a complete solution:

utf8.test: fooÆØÅござ

$utf8 = file_get_contents("utf8.test");

for ($i = 0; $i < mb_strlen($utf8, "utf-8"); $i++)
{
    $char = mb_substr($utf8, $i, 1, "utf-8");

    print($char);
    print("\n");

    for ($j = 0; $j < strlen($char); $j++)
    {
        print(dechex(ord($char[$j])));
    }

    print("\n\n");
}

Output:

f
66

o
6f

o
6f

Æ
c386

Ø
c398

Å
c385

ご
e38194

ざ
e38196

Hope that helps.