PHP: Convert any string to UTF-8 without knowing t

2019-01-01 01:47发布

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text); but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/

For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).

I've read the other SO questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").

But there must be something that at least has a good try!

10条回答
几人难应
2楼-- · 2019-01-01 02:14

There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4 encoding for tables, fields, and connections.

My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:

  1. Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
  2. If encoding cannot be detected, throw new RuntimeException
  3. If input is UTF-8, carry on.
  4. Else, if it is ISO-8859-1 or ASCII

    a. Attempt conversion to UTF-8 (wait, not finished)

    b. Detect the encoding of the converted value

    c. If the reported encoding and converted value are both UTF-8, carry on.

    d. Else, throw new RuntimeException

From my abstract class Sanitizer

Sanitizer

    private function isUTF8($encoding, $value)
    {
        return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
    }

    private function utf8tify(&$value)
    {
        $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];

        mb_internal_encoding('UTF-8');
        mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
        mb_detect_order($encodings);

        $stringEncoding = mb_detect_encoding($value, $encodings, true);

        if (!$stringEncoding) {
            $value = null;
            throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
        }

        if ($this->isUTF8($stringEncoding, $value)) {
            return;
        } else {
            $value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
            $stringEncoding = mb_detect_encoding($value, $encodings, true);

            if ($this->isUTF8($stringEncoding, $value)) {
                return;
            } else {
                $value = null;
                throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
            }
        }

        return;
    }

One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.

What every programmer absolutely, positively needs to know about encodings and character sets to work with text

Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL or mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode() and utf8_encode() in Sanitizer::isUTF8 are dubious.

People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.

查看更多
余生无你
3楼-- · 2019-01-01 02:17
public function convertToUtf8($text) {
    if(!$this->html)
        $this->html = cURL('http://'.$this->url, array('timeout' => 15));

    $html = $this->html;
    preg_match('/<meta.*?charset=(|\")(.*?)("|\")/i', $html, $matches);

    $charset = $matches[2];

    if($charset)
        return mb_convert_encoding($text, 'UTF-8', $charset);
    else
        return $text;
}

cURL default options:

curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

I tried something like this. It helped me. If found on meta charset info, I'm converting, otherwise doing nothing.

查看更多
刘海飞了
4楼-- · 2019-01-01 02:18

There is no way to identify the charset of a string that is completely accurate. There are ways to try to guess the charset. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding(). This will scan your string and look for occurrences of stuff unique to certain charsets. Depending on your string, there may not be such distinguishable occurrences.

Take the ISO-8859-1 charset vs ISO-8859-15 ( http://en.wikipedia.org/wiki/ISO/IEC_8859-15#Changes_from_ISO-8859-1 )

There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing it's encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know it's exact charset.

(Note: you could add a human factor, or an even more advanced scanning technique (e.g. what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far)

There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.

Interesting read: http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-determine-the-charset-encoding-of-a-string

There are other ways of ensuring the correct charset though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure yout submission will be UTF-8 in every browser: http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen ) That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the unix 'file -i' command on it through e.g. exec() (if possible on your server) to aid the detection (using the document's BOM.) Concerning scraping data, you could read the HTTP headers, that usually specify the charset. When parsing XML files, see if the XML meta-data contain a charset definition.

Rather than trying to automagically guess the charset, you should first try to ensure a certain charset yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.

查看更多
呛了眼睛熬了心
5楼-- · 2019-01-01 02:22

What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

However, you could try doing this:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

Setting it to strict might help you get a better result.

查看更多
长期被迫恋爱
6楼-- · 2019-01-01 02:22

If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encoding essentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)

enca also came up here: How to find encoding of a file in Unix via script(s)

查看更多
情到深处是孤独
7楼-- · 2019-01-01 02:28

It seems that your question is quite answered, but i have an approach that may simplify you case:

I had a similar issue trying to return string data from mysql, even configuring both database and php to return strings formatted to utf-8. The only way i got the error was actually returning them from the database.

Finally, sailing through the web i found a really easy way to deal with it:

Giving that you can save all those types of string data in your mysql in different formats and collations, what you only need to do is, right at your php connection file, set the collation to utf-8, like this:

$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");

Wich means that first you save the data in any format or collation and you convert it only at the return to your php file.

Hope it was helpful!

查看更多
登录 后发表回答