How can you parse excel CSV data that contains lin

2020-03-02 07:09发布

问题:

I'm attempting to parse a set of CSV data using PHP, but having a major issue. One of the fields is a long description field, which itself contains linebreaks within the enclosures.

My primary issue is writing a piece of code that can split the data line by line, but also recognize when linebreaks within the data should not be used. The linebreaks within this field are not properly escaped, making them hard to distinguish from legitimate linebreaks.

I've tried to come up with a regular expression that can properly handle it, but had no luck so far. Any ideas?

CSV format:

"####","text data here", "text data \n with linebreaks \n here"\n
"####","more text data", "more data \n with \n linebreaks \n here"\n

回答1:

According to aleske, a commenter in the documentation for PHP's fgetcsv function:

The PHP's CSV handling stuff is non-standard and contradicts with RFC4180, thus fgetcsv() cannot properly deal with files [that contain line breaks] ...

And he offered up the following function to get around this limitation:

function csvstring_to_array(&$string, $CSV_SEPARATOR = ';', $CSV_ENCLOSURE = '"', $CSV_LINEBREAK = "\n") { 
  $o = array(); 

  $cnt = strlen($string); 
  $esc = false; 
  $escesc = false; 
  $num = 0; 
  $i = 0; 
  while ($i < $cnt) { 
$s = $string[$i]; 

if ($s == $CSV_LINEBREAK) { 
  if ($esc) { 
    $o[$num] .= $s; 
  } else { 
    $i++; 
    break; 
  } 
} elseif ($s == $CSV_SEPARATOR) { 
  if ($esc) { 
    $o[$num] .= $s; 
  } else { 
    $num++; 
    $esc = false; 
    $escesc = false; 
  } 
} elseif ($s == $CSV_ENCLOSURE) { 
  if ($escesc) { 
    $o[$num] .= $CSV_ENCLOSURE; 
    $escesc = false; 
  } 

  if ($esc) { 
    $esc = false; 
    $escesc = true; 
  } else { 
    $esc = true; 
    $escesc = false; 
  } 
} else { 
  if ($escesc) { 
    $o[$num] .= $CSV_ENCLOSURE; 
    $escesc = false; 
  } 

  $o[$num] .= $s; 
} 

$i++; 
  } 

//  $string = substr($string, $i); 

  return $o; 
} 

That looks like it will do the trick.



回答2:

I found that you can use a normal CSV parser after you convert the CSV to unix format.

Here is a function that did the trick for me .

function dos2unix($s) {
    $s = str_replace("\r\n", "\n", $s);
    $s = str_replace("\r", "\n", $s);
    $s = preg_replace("/\n{2,}/", "\n\n", $s);
    return $s;
}

And a parsing function

function csvstring_to_array($string, $separatorChar = ',', $enclosureChar = '"', $newlineChar = PHP_EOL) {
    // @author: Klemen Nagode
    $string = dos2unix($string);
    $array = array();
    $size = strlen($string);
    $columnIndex = 0;
    $rowIndex = 0;
    $fieldValue="";
    $isEnclosured = false;
    for($i=0; $i<$size;$i++) {

        $char = $string{$i};
        $addChar = "";

        if($isEnclosured) {
            if($char==$enclosureChar) {

                if($i+1<$size && $string{$i+1}==$enclosureChar){
                    // escaped char
                    $addChar=$char;
                    $i++; // dont check next char
                }else{
                    $isEnclosured = false;
                }
            }else {
                $addChar=$char;
            }
        }else {
            if($char==$enclosureChar) {
                $isEnclosured = true;
            }else {

                if($char==$separatorChar) {

                    $array[$rowIndex][$columnIndex] = $fieldValue;
                    $fieldValue="";

                    $columnIndex++;
                }elseif($char==$newlineChar) {
                    echo $char;
                    $array[$rowIndex][$columnIndex] = $fieldValue;
                    $fieldValue="";
                    $columnIndex=0;
                    $rowIndex++;
                }else {
                    $addChar=$char;
                }
            }
        }
        if($addChar!=""){
            $fieldValue.=$addChar;

        }
    }

    if($fieldValue) { // save last field
        $array[$rowIndex][$columnIndex] = $fieldValue;
    }
    return $array;
}


回答3:

The problem is that the "\n" escape string doesn't evaluate to the same new line character that that Excel uses for its row delimiter. The ASCII character that Excel uses is ASCII 13. The following code will efficiently parse a .csv file that is passed in via the $file_get_contents ()method.

<?php

//variable to store filename of file
$filename = $_SERVER['DOCUMENT_ROOT'] . "/site/docs/boothmap.csv";

//read file in as string
$file = file_get_contents($filename);

//convert csv to array
//first to single dimensional array
$array1D = explode(chr(13),$file);

//create new array to hold 2d array
$array2D = array();

//iterate through 1 dimensional array and explode each value to the new array
foreach($array1D as &$row)
{
array_push($array2D, explode(',',$row));
}

//pop off empty last row of array2D
array_pop($array2D);

//iterate through $array2D building table of data
//start table with column headers
echo "<table border=\"1\">\n<tr>\n<th>Company</th>\n<th>Booth #</th>\n<th>Location</th>\n</tr>\n";

foreach ($array2D as &$row)
{
    echo "<tr>\n";
    foreach($row as &$subrow)
    {
        echo "<td>" . $subrow . "</td>\n";
    }
    echo "</tr>\n";
}

//close table
echo "</table>";


回答4:

I ended up being able to modify a regular expression with certain special flags to work for my needs. I used the following function call:

preg_match_all('/"\d+",".*",".*"\n/sU', $csv_data, $matches);

This seems to work for a few reasons:

1) The 's' flag tells the editor to catch newlines under the dot, which normally isn't the case. The unfortunate side effect of this is that legitimate newline characters are also caught by the dot, which could theoretically match the entire CSV to one result, so

2) I added the U flag. This tells the dot to be ungreedy by default, and as such, it currently only matches one line a piece.



回答5:

It's an old thread but i encountered this problem and i solved it with a regex so you can avoid a library just for that. Here the code is in PHP but it can be adapted to other language.

$parsedCSV = preg_replace('/(,|\n|^)"(?:([^\n"]*)\n([^\n"]*))*"/', '$1"$2 $3"', $parsedCSV);

It might not be efficient if the content is too large, but it can help for many cases and the idea can be reused, maybe optimized by doing this for smaller chunks (but you'd need to handle the cuts with fix-sized buffers). This solutions supposes the fields containing a linebreak are enclosed by double quotes, which seems to be a valid assumption, at least for what i have seen so far. Also, the double quotes should follow a , or be placed at the start of a new line (or first line).

Example:

field1,"field2-part1\nfield2-part2",field3

Here the \n is replaced by a whitespace so the result would be:

field1,"field2-part1 field2-part2",field3

The regex should handle multiple linebreaks as well.



回答6:

I created this PHP function to parse a CSV into a 2D array. It can handle data that contains commas, quotes or line breaks. This runs faster than some other working solutions.

/**
 * copyright 2018 Frank Forte
 * Free for personal, non-commercial use
 * contact me for inexpensive licenses to use and create derivative works
 */
protected static function parse_csv_forte (&$str, $delimiter = ",", $enclosure = '"', $escape = '"', $skip_empty_lines = true, $trim_fields = false)
{
    // use linux line endings
    $str = str_replace("\r\n","\n",$str);
    $str = str_replace("\r","\n",$str);

    // substitute line endings that are part of data
    $num = strlen($str);
    $quoted = false;
    $last = null;
    $escape = false;
    for($i = 0; $i < $num; $i++)
    {
        if($str[$i] == $enclosure)
        {
            if($last == $enclosure)
            {
                $escape = !$escape;
                if($escape)
                {
                    $quoted = !$quoted;
                }
            }
            else
            {
                if(!$escape)
                {
                    $quoted = !$quoted;
                }
            }
        }
        if($str[$i] != $enclosure || $escape)
        {
            $escape = false;
        }
        if($quoted && $str[$i] == "\n")
        {
            $str[$i] = "\r";
        }
        $last = $str[$i];
    }

    if($skip_empty_lines)
    {
        $str = preg_replace("/\n+/","\n",$str);
        $str = trim($str,"\n");
    }

    $str = explode("\n",$str);

    $csv = [];
    foreach($str as $e)
    {
        $e = str_getcsv($e, $delimiter, $enclosure, $escape);
        foreach($e as $k => $f)
        {
            $e[$k] = str_replace("\r","\n",$f);
            if($trim_fields)
            {
                $e[$k] = trim($e[$k]);
            }
        }
        $csv[] = $e;
    }

    return $csv;
}

Use it like this:

$csv = parse_csv_forte(file_get_contents('file.csv'));


回答7:

This will work: https://github.com/synappnz/php-csv

include "csv.php";
$csv = new csv(file_get_contents("filename.csv"));
$rows = $csv->rows();
foreach ($rows as $row)
{
  // do something with $row
}


回答8:

This a fix for @Stephen's answer. It manage the multilines, and preserve empty cells in the array :

function csvstring_to_array(&$string, $CSV_SEPARATOR = ';', $CSV_ENCLOSURE = '"', $CSV_LINEBREAK = "\n") {
    $o = array();

    $cnt = strlen($string);
    $esc = false;
    $escesc = false;
    $num = 0;
    $i = 0;
    $line = 0;
    while ($i < $cnt) {
        $s = $string[$i];

        if ($s == $CSV_LINEBREAK) {
            if ($esc) {
                $o[$line][$num] .= $s;
            } else {
                $i++;
                $line++;
                $num = 0;
                continue;
            }
        } elseif ($s == $CSV_SEPARATOR) {
            if ($esc) {
                $o[$line][$num] .= $s;
            } else {
                $num++;

                $o[$line][$num] .= '';
                $esc = false;
                $escesc = false;
            }
        } elseif ($s == $CSV_ENCLOSURE) {
            if ($escesc) {
                $o[$line][$num] .= $CSV_ENCLOSURE;
                $escesc = false;
            }

            if ($esc) {
                $esc = false;
                $escesc = true;
            } else {
                $esc = true;
                $escesc = false;
            }
        } else {
            if ($escesc) {
                $o[$line][$num] .= $CSV_ENCLOSURE;
                $escesc = false;
            }

            $o[$line][$num] .= $s;
        }

        $i++;
    }

    return $o;
}


回答9:

You can use fgetcsv or strgetcsv to parse a csv. Look the examples inside of the php documentation.