转换UCS-2文件,以UTF-8 PHP(Convert UCS-2 file to UTF-8 w

2019-06-25 07:32发布

我已经从具有要被解析并使用PHP插入到数据库中的客户端提供的CSV文件。

在将数据插入到数据库之前,我想将其转换为UTF-8,但我不能似乎找到如何。

这是我得到了想要检测的文件编码:

$ enca -d -L zh ./artigos.txt 
    ./artigos.txt: Universal character set 2 bytes; UCS-2; BMP
    CRLF line terminators
    Byte order reversed in pairs (1,2 -> 2,1)

我尝试使用的iconv功能,但它搅乱了转换和显示与diferent字符比正版的结果。

该文件(base64编码)的第一行:

IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK

Answer 1:

这似乎工作(小端),althoug你没有包括任何非ASCII字符

$s='IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK';
$t=base64_decode($s);
echo iconv('UCS-2LE', 'UTF-8', substr($t, 0, -1));//last byte was invalid


Answer 2:

Microsoft Excel中的CSV编码一般小端(我花了长时间才找到了)。 如果你想与如fgetcsv使用它们,你应该将文件转换为前UTF-8。 我做到以下几点:

        $str=file_get_contents($file);
        $str= mb_convert_encoding($str, 'UTF-8', 'UCS-2LE'); 
        file_put_contents("converted_".$file, $str);


Answer 3:

蟒蛇:

一个进行编码的方法的是

文本 - > UTF-16是 - >十六进制

转换回

十六进制到二进制文件,然后从UTF-16是为文本

注:UCS-2BE已被弃用,移动到UTF-16是

解码器

import binascii
code = '098 ... '
decoded_text = binascii.unhexlify(code).decode('utf-16-be')


文章来源: Convert UCS-2 file to UTF-8 with PHP
标签: php encoding