How to convert any character encoding to UTF8 on P

I'm working on a web crawler that grabs data from sites all over the world, and is dealing with distinct languages and encodings.

Currently I'm using the following function, and it works in 99% of the cases. But there is this 1% that is giving me headaches.

function convertEncoding($str) {
    return iconv(mb_detect_encoding($str), "UTF-8", $str);
}

标签： php encoding utf-8

3条回答

一夜七次

2楼-- · 2020-03-04 08:19

It's not possible to detect character set of a string in 100% rate since some character sets are subset of some others. Try setting character set explicitly if possible without mixing iconv and mbstring functions. I recommend using a function like this and supplying from charset whenever possible:

function convertEncoding($str, $from = 'auto', $to = "UTF-8") {
    if($from == 'auto') $from = mb_detect_encoding($str);
    return mb_convert_encoding ($str , $to, $from); 
}

0人赞添加讨论(0) 举报

小情绪 Triste *

3楼-- · 2020-03-04 08:32

Rather than blindly trying to detect the encoding, you should first check if the page that you downloaded has a listed character set. The character set may be set in the HTTP response header, for example:

Content-Type:text/html; charset=utf-8

Or in the HTML as a meta tag, for example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Only if neither are available then try to guess the encoding with mb_detect_encoding() or other methods.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

4楼-- · 2020-03-04 08:39

You can try utf_encode($str).

http://www.php.net/manual/en/function.utf8-encode.php#89789

Or you can replace the content type meta tag with

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

from header of crawled content

0人赞添加讨论(0) 举报

How to convert any character encoding to UTF8 on P

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间