Check if csv file is in UTF-8 with PHP

2019-03-05 18:15发布

问题:

Is there a way which checks a CSV-file for UTF-8 without BOM encoding? I want to check the whole file and not a single string.

I would try to set the first line with a special character and than reading the string and checking if it matches the same string hard-coded in my script. But I don't know if this is a good idea.

Google only showed me this. But the link in the last post isn't available.

回答1:

if (mb_check_encoding(file_get_contents($file), 'UTF-8')) {
    // yup, all UTF-8
}

You can also go through it line by line with fgets, if the file is large and you don't want to store it all in memory at once. Not sure what you mean by the second part of your question.



回答2:

I recommand this function (from the symfony toolkit):

<?php
  /**
   * Checks if a string is an utf8.
   *
   * Yi Stone Li<yili@yahoo-inc.com>
   * Copyright (c) 2007 Yahoo! Inc. All rights reserved.
   * Licensed under the BSD open source license
   *
   * @param string
   *
   * @return bool true if $string is valid UTF-8 and false otherwise.
   */
  public static function isUTF8($string)
  {
    for ($idx = 0, $strlen = strlen($string); $idx < $strlen; $idx++)
    {
      $byte = ord($string[$idx]);

      if ($byte & 0x80)
      {
        if (($byte & 0xE0) == 0xC0)
        {
          // 2 byte char
          $bytes_remaining = 1;
        }
        else if (($byte & 0xF0) == 0xE0)
        {
          // 3 byte char
          $bytes_remaining = 2;
        }
        else if (($byte & 0xF8) == 0xF0)
        {
          // 4 byte char
          $bytes_remaining = 3;
        }
        else
        {
          return false;
        }

        if ($idx + $bytes_remaining >= $strlen)
        {
          return false;
        }

        while ($bytes_remaining--)
        {
          if ((ord($string[++$idx]) & 0xC0) != 0x80)
          {
            return false;
          }
        }
      }
    }

    return true;
  }

But as it check all the characters of the string, I don't recommand to use it on a large file. Just check the first 10 lines i.e.

<?php
$handle = fopen("mycsv.csv", "r");
$check_string = "";
$line = 1;
if ($handle) {
    while ((($buffer = fgets($handle, 4096)) !== false) && $line < 11) {
        $check_string .= $buffer;
        $line++;
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail\n";
    }
    fclose($handle);

    var_dump( self::isUTF8($check_string) );
}


标签: php csv utf-8