How to split csv whose columns may contain ,

2018-12-31 09:11发布

Given

2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,"Corvallis, OR",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34

How to use C# to split the above information into strings as follows:

2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34

As you can see one of the column contains , <= (Corvallis, OR)

// update // Based on C# Regex Split - commas outside quotes

string[] result = Regex.Split(samplestring, ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

标签: c# .net csv
8条回答
查无此人
2楼-- · 2018-12-31 09:45

It is a tricky matter to parse .csv files when the .csv file could be either comma separated strings, comma separated quoted strings, or a chaotic combination of the two. The solution I came up with allows for any of the three possibilities.

I created a method, ParseCsvRow() which returns an array from a csv string. I first deal with double quotes in the string by splitting the string on double quotes into an array called quotesArray. Quoted string .csv files are only valid if there is an even number of double quotes. Double quotes in a column value should be replaced with a pair of double quotes (This is Excel's approach). As long as the .csv file meets these requirements, you can expect the delimiter commas to appear only outside of pairs of double quotes. Commas inside of pairs of double quotes are part of the column value and should be ignored when splitting the .csv into an array.

My method will test for commas outside of double quote pairs by looking only at even indexes of the quotesArray. It also removes double quotes from the start and end of column values.

    public static string[] ParseCsvRow(string csvrow)
    {
        const string obscureCharacter = "ᖳ";
        if (csvrow.Contains(obscureCharacter)) throw new Exception("Error: csv row may not contain the " + obscureCharacter + " character");

        var unicodeSeparatedString = "";

        var quotesArray = csvrow.Split('"');  // Split string on double quote character
        if (quotesArray.Length > 1)
        {
            for (var i = 0; i < quotesArray.Length; i++)
            {
                // CSV must use double quotes to represent a quote inside a quoted cell
                // Quotes must be paired up
                // Test if a comma lays outside a pair of quotes.  If so, replace the comma with an obscure unicode character
                if (Math.Round(Math.Round((decimal) i/2)*2) == i)
                {
                    var s = quotesArray[i].Trim();
                    switch (s)
                    {
                        case ",":
                            quotesArray[i] = obscureCharacter;  // Change quoted comma seperated string to quoted "obscure character" seperated string
                            break;
                    }
                }
                // Build string and Replace quotes where quotes were expected.
                unicodeSeparatedString += (i > 0 ? "\"" : "") + quotesArray[i].Trim();
            }
        }
        else
        {
            // String does not have any pairs of double quotes.  It should be safe to just replace the commas with the obscure character
            unicodeSeparatedString = csvrow.Replace(",", obscureCharacter);
        }

        var csvRowArray = unicodeSeparatedString.Split(obscureCharacter[0]); 

        for (var i = 0; i < csvRowArray.Length; i++)
        {
            var s = csvRowArray[i].Trim();
            if (s.StartsWith("\"") && s.EndsWith("\""))
            {
                csvRowArray[i] = s.Length > 2 ? s.Substring(1, s.Length - 2) : "";  // Remove start and end quotes.
            }
        }

        return csvRowArray;
    }

One downside of my approach is the way I temporarily replace delimiter commas with an obscure unicode character. This character needs to be so obscure, it would never show up in your .csv file. You may want to put more handling around this.

查看更多
伤终究还是伤i
3楼-- · 2018-12-31 09:49

Use the Microsoft.VisualBasic.FileIO.TextFieldParser class. This will handle parsing a delimited file, TextReader or Stream where some fields are enclosed in quotes and some are not.

For example:

using Microsoft.VisualBasic.FileIO;

string csv = "2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,\"Corvallis, OR\",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";

TextFieldParser parser = new TextFieldParser(new StringReader(csv));

// You can also read from a file
// TextFieldParser parser = new TextFieldParser("mycsvfile.csv");

parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");

string[] fields;

while (!parser.EndOfData)
{
    fields = parser.ReadFields();
    foreach (string field in fields)
    {
        Console.WriteLine(field);
    }
} 

parser.Close();

This should result in the following output:

2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34

See Microsoft.VisualBasic.FileIO.TextFieldParser for more information.

You need to add a reference to Microsoft.VisualBasic in the Add References .NET tab.

查看更多
情到深处是孤独
4楼-- · 2018-12-31 09:49

Use a library like LumenWorks to do your CSV reading. It'll handle fields with quotes in them and will likely overall be more robust than your custom solution by virtue of having been around for a long time.

查看更多
忆尘夕之涩
5楼-- · 2018-12-31 09:52

You could split on all commas that do have an even number of quotes following them.

You would also like to view at the specf for CSV format about handling comma's.

Useful Link : C# Regex Split - commas outside quotes

查看更多
无色无味的生活
6楼-- · 2018-12-31 09:52

I had a problem with a CSV that contains fields with a quote character in them, so using the TextFieldParser, I came up with the following:

private static string[] parseCSVLine(string csvLine)
{
  using (TextFieldParser TFP = new TextFieldParser(new MemoryStream(Encoding.UTF8.GetBytes(csvLine))))
  {
    TFP.HasFieldsEnclosedInQuotes = true;
    TFP.SetDelimiters(",");

    try 
    {           
      return TFP.ReadFields();
    }
    catch (MalformedLineException)
    {
      StringBuilder m_sbLine = new StringBuilder();

      for (int i = 0; i < TFP.ErrorLine.Length; i++)
      {
        if (i > 0 && TFP.ErrorLine[i]== '"' &&(TFP.ErrorLine[i + 1] != ',' && TFP.ErrorLine[i - 1] != ','))
          m_sbLine.Append("\"\"");
        else
          m_sbLine.Append(TFP.ErrorLine[i]);
      }

      return parseCSVLine(m_sbLine.ToString());
    }
  }
}

A StreamReader is still used to read the CSV line by line, as follows:

using(StreamReader SR = new StreamReader(FileName))
{
  while (SR.Peek() >-1)
    myStringArray = parseCSVLine(SR.ReadLine());
}
查看更多
柔情千种
7楼-- · 2018-12-31 10:03

With Cinchoo ETL - an open source library, it can automatically handles columns values containing separators.

string csv = @"2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,""Corvallis, OR"",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";

using (var p = ChoCSVReader.LoadText(csv)
    )
{
    Console.WriteLine(p.Dump());
}

Output:

Key: Column1 [Type: String]
Value: 2
Key: Column2 [Type: String]
Value: 1016
Key: Column3 [Type: String]
Value: 7/31/2008 14:22
Key: Column4 [Type: String]
Value: Geoff Dalgas
Key: Column5 [Type: String]
Value: 6/5/2011 22:21
Key: Column6 [Type: String]
Value: http://stackoverflow.com
Key: Column7 [Type: String]
Value: Corvallis, OR
Key: Column8 [Type: String]
Value: 7679
Key: Column9 [Type: String]
Value: 351
Key: Column10 [Type: String]
Value: 81
Key: Column11 [Type: String]
Value: b437f461b3fd27387c5d8ab47a293d35
Key: Column12 [Type: String]
Value: 34

For more information, please visit codeproject article.

Hope it helps.

查看更多
登录 后发表回答