Dealing with commas in a CSV file

2018-12-31 01:21发布

I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.

Some of the ideas we are looking at are: quoted Identifiers (value "," values ","etc) or using a | instead of a comma. The biggest problem is that we have to make it easy, or the customer won't do it.

标签: csv
23条回答
倾城一夜雪
2楼-- · 2018-12-31 02:02

In Europe we have this problem must earlier than this question. In Europe we use all a comma for a decimal point. See this numbers below:

| American      | Europe        |
| ------------- | ------------- |
| 0.5           | 0,5           |
| 3.14159265359 | 3,14159265359 |
| 17.54         | 17,54         |
| 175,186.15    | 175.186,15    |

So it isn't possible to use the comma separator for CSV files. Because of that reason, the CSV files in Europe are separated by a semicolon (;).

Programs like Microsoft Excel can read files with a semicolon and it's possible to switch from separator. You could even use a tab (\t) as separator. See this answer from Supper User.

查看更多
君临天下
3楼-- · 2018-12-31 02:03

As this is about general practices let's start from rules of the thumb:

  1. Don't use CSV, use XML with a library to read & write the xml file instead.

  2. If you must use CSV. Do it properly and use a free library to parse and store the CSV files.

To justify 1), most CSV parsers aren't encoding aware so if you aren't dealing with US-ASCII you are asking for troubles. For example excel 2002 is storing the CSV in local encoding without any note about the encoding. The CSV standard isn't widely adopted :(. On the other hand xml standard is well adopted and it handles encodings pretty well.

To justify 2), There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple.

To name few:

  • for python use build in csv module

  • for perl check CPAN and Text::CSV

  • for php use build in fgetcsv/fputcsv functions

  • for java check SuperCVS library

Really there is no need to implement this by hand if you aren't going to parse it on embedded device.

查看更多
泪湿衣
4楼-- · 2018-12-31 02:05

First, let's ask ourselves, "Why do we feel the need to handle commas differently for CSV files?"

For me, the answer is, "Because when I export data into a CSV file, the commas in a field disappear and my field gets separated into multiple fields where the commas appear in the original data." (That it because the comma is the CSV field separator character.)

Depending on your situation, semi colons may also be used as CSV field separators.

Given my requirements, I can use a character, e.g., single low-9 quotation mark, that looks like a comma.

So, here's how you can do it in Go:

// Replace special CSV characters with single low-9 quotation mark
func Scrub(a interface{}) string {
    s := fmt.Sprint(a)
    s = strings.Replace(s, ",", "‚", -1)
    s = strings.Replace(s, ";", "‚", -1)
    return s
}

The second comma looking character in the Replace function is decimal 8218.

Be aware that if you have clients that may have ascii-only text readers that this decima 8218 character will not look like a comma. If this is your case, then I'd recommend surrounding the field with the comma (or semicolon) with double quotes per RFC 4128: https://tools.ietf.org/html/rfc4180

查看更多
不再属于我。
5楼-- · 2018-12-31 02:07

There is a library available through nuget for dealing with pretty much any well formed CSV (.net) - CsvHelper

Example to map to a class:

var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>();

Example to read individual fields:

var csv = new CsvReader( textReader );
while( csv.Read() )
{
    var intField = csv.GetField<int>( 0 );
    var stringField = csv.GetField<string>( 1 );
    var boolField = csv.GetField<bool>( "HeaderName" );
}

Letting the client drive the file format:
, is the standard field delimiter, " is the standard value used to escape fields that contain a delimiter, quote, or line ending.

To use (for example) # for fields and ' for escaping:

var csv = new CsvReader( textReader );
csv.Configuration.Delimiter = "#";
csv.Configuration.Quote = ''';
// read the file however meets your needs

More Documentation

查看更多
怪性笑人.
6楼-- · 2018-12-31 02:08

For 2017, csv is fully specified - RFC 4180.

It is a very common specification, and is completely covered by many libraries (example).

Simply use any easily-available csv library - that is to say RFC 4180.


There's actually a spec for CSV format and how to handle commas:

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

http://tools.ietf.org/html/rfc4180

So, to have values foo and bar,baz, you do this:

foo,"bar,baz"

Another important requirement to consider (also from the spec):

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

"aaa","b""bb","ccc"
查看更多
梦该遗忘
7楼-- · 2018-12-31 02:11

As mentioned in my comment to harpo's answer, his solution is good and works in most cases, however in some scenarios when commas as directly adjacent to each other it fails to split on the commas.

This is because of the Regex string behaving unexpectedly as a vertabim string. In order to get this behave correct, all " characters in the regex string need to be escaped manually without using the vertabim escape.

Ie. The regex should be this using manual escapes:

",(?=(?:[^\"\"]*\"\"[^\"\"]*\"\")*(?![^\"\"]*\"\"))"

which translates into ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"

When using a vertabim string @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" it behaves as the following as you can see if you debug the regex:

",(?=(?:[^"]*"[^"]*")*(?![^"]*"))"

So in summary, I recommend harpo's solution, but watch out for this little gotcha!

I've included into the CsvReader a little optional failsafe to notify you if this error occurs (if you have a pre-known number of columns):

if (_expectedDataLength > 0 && values.Length != _expectedDataLength) 
throw new DataLengthException(string.Format("Expected {0} columns when splitting csv, got {1}", _expectedDataLength, values.Length));

This can be injected via the constructor:

public CsvReader(string fileName, int expectedDataLength = 0) : this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
    _expectedDataLength = expectedDataLength;
}
查看更多
登录 后发表回答