Error: “Input is not proper UTF-8, indicate encodi-第2页回答

I'm getting the error:

parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20

When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:

<?xml version="1.0" encoding="UTF-8"?>

Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.

I'm unable to get the 3rd party to sort out their XML.

How can I pre-process the XML and fix the encoding incompatibilities?

Is there a way to detect the correct encoding for a XML file?

标签： php xml encoding character-encoding simplexml

10条回答

beautiful°

2楼-- · 2019-01-04 09:36

I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.

I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-01-04 09:39

Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.

Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)

Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.

Either way, notify your data provider that they're sending invalid data so that they can fix it.

Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}

0人赞添加讨论(0) 举报

混吃等死

4楼-- · 2019-01-04 09:39

If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)

String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.

0人赞添加讨论(0) 举报

SAY GOODBYE

5楼-- · 2019-01-04 09:42

I solved this using

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);

0人赞添加讨论(0) 举报

上一页 1 2

Error: “Input is not proper UTF-8, indicate encodi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间