Split using delimiter except when delimiter is esc

2019-01-19 21:08发布

问题:

I'm reading clipboard data coming from excel using

var stream = (System.IO.Stream) ( Forms.Clipboard.GetDataObject() ).GetData( Forms.DataFormats.CommaSeparatedValue );,

but unfortunately, excel is passing cell text instead of cell values. When the cells are using special formatting (such as the thousands seperator), the clipboard data for a series of cells in columns that looks like this:

 1,234,123.00    2,345.00    342.00      12,345.00

is stored as this:

\" 1,234,123.00 \",\" 2,345.00 \", 342.00 ,\" 12,345.00 \"

when what I really want is this:

 1234123.00, 2345.00, 342.00, 12345.00

I had been previously using the clipData.Split(new string[] { "," }, StringSllitOptions.None)) function to turn my CSV clipboard data into a series of cells, but this fails when there is escaped formatted text containing commas.


I'm asking if anyone can think of a way to split this string into a set of cells, ignoring the commas escaped within the \" bits, since this is how Excel is choosing to escape cells containing commas.

In short, how can I turn a single string containing this:

\" 1,234,123.00 \",\" 2,345.00 \", 342.00 ,\" 12,345.00 \"

into an array of strings containing this:

{ "1,234,123.00", "2,345.00", "342.00", "12,345.00" }

Without ruining my ability to parse a simple comma delimited string.

*****edit***

Follow up question (formulated as a DFA) here: Split a string based on each time a Deterministic Finite Automata reaches a final state?

回答1:

First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like "". So assuming all of that here's how I've dealt with this in the past

public static IEnumerable<string> SplitExcelRow(this string value)
{
    value = value.Replace("\"\"", "&quot;");
    bool quoted = false;
    int currStartIndex = 0;
    for (int i = 0; i < value.Length; i++)
    {
        char currChar = value[i];
        if (currChar == '"')
        {
            quoted = !quoted;       
        }
        else if (currChar == ',')
        {
            if (!quoted)
            {
                yield return value.Substring(currStartIndex, i - currStartIndex)
                    .Trim()
                    .Replace("\"","")
                    .Replace("&quot;","\"");
                currStartIndex = i + 1;
            }
        }
    }
    yield return value.Substring(currStartIndex, value.Length - currStartIndex)
        .Trim()
        .Replace("\"", "")
        .Replace("&quot;", "\"");
}

Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo" this will not work. Additionally if your data contains &quot; then it will be turned into a " which may or may not be desirable.



回答2:

There are a lot of ways to do this. One inelegant way that would work is:

  1. Convert \",\" to tab or some other delimiter (I assume you left out a few \" in your example because otherwise the string is not consistent
  2. Strip all remaining commas
  3. Strip all remaining \"
  4. Convert your delimiter (e.g. tab) back into a comma

Now you have what you wanted in first place



回答3:

I agree with Kyle regarding your string probably not being consistent.

Instead of Kyle's first step you could use

string[] vals = Regex.Split(value, @"\s*\"",\s*");


回答4:

From your input example, we can see that there are three "unwanted" sequences of characters:

\"
\",
,\"

So, add all these sequences to the input array for the Split method:

string[] result = clipData.Split(new[] { @",\""", @"\"",", @"\""" }, 
    StringSplitOptions.None);

This will give you an array containing a few empty elements. If that is a problem, use StringSplitOptions.RemoveEmptyEntries instead of StringSplitOptions.None:

string[] result = clipData.Split(new[] { @",\""", @"\"",", @"\""" }, 
    StringSplitOptions.RemoveEmptyEntries);


回答5:

You could try to use a bit of LINQ:

string excelData = "\\\" 1,234,123.00 \\\",\\\" 2,345.00 \\\", 342.00 ,\\\" 12,345.00 \\\"";

IEnumerable<string> cells = from x in excelData.Split(new string[] { "\\\"" }, StringSplitOptions.RemoveEmptyEntries)
                            let y = x.Trim(',').Trim()
                            where !string.IsNullOrWhiteSpace(y)
                            select y;

Alternatively, if you don't like this suggestion, try to implement a similar pattern with RegEx.