Regex to split a CSV

2019-01-03 02:35发布

I know this (or similar) has been asked many times but having tried out numerous possibilities I've not been able to find a a regex that works 100%.

I've got a CSV file and I'm trying to split it into an array, but encountering two problems: quoted commas and empty elements.

The CSV looks like:

123,2.99,AMO024,Title,"Description, more info",,123987564

The regex I've tried to use is:

thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

The only problem is that in my output array the 5th element comes out as 123987564 and not an empty string.

17条回答
Emotional °昔
2楼-- · 2019-01-03 02:53

I personally tried many RegEx expressions without having found the perfect one that match all cases.

I think that regular expressions is hard to configure properly to match all cases properly. Although few persons will not like the namespace (and I was part of them), I propose something that is part of the .Net framework and give me proper results all the times in all cases (mainly managing every double quotes cases very well):

Microsoft.VisualBasic.FileIO.TextFieldParser

Found it here: StackOverflow

Example of usage:

TextReader textReader = new StringReader(simBaseCaseScenario.GetSimStudy().Study.FilesToDeleteWhenComplete);
Microsoft.VisualBasic.FileIO.TextFieldParser textFieldParser = new TextFieldParser(textReader);
textFieldParser.SetDelimiters(new string[] { ";" });
string[] fields = textFieldParser.ReadFields();
foreach (string path in fields)
{
    ...

Hope it could help.

查看更多
太酷不给撩
3楼-- · 2019-01-03 02:55

I'm late to the party, but the following is the Regular Expression I use:

(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

This pattern has three capturing groups:

  1. Contents of a quoted cell
  2. Contents of an unquoted cell
  3. A new line

This pattern handles all of the following:

  • Normal cell contents without any special features: one,2,three
  • Cell containing a double quote (" is escaped to ""): no quote,"a ""quoted"" thing",end
  • Cell contains a newline character: one,two\nthree,four
  • Normal cell contents which have an internal quote: one,two"three,four
  • Cell contains quotation mark followed by comma: one,"two ""three"", four",five

See this pattern in use.

If you have are using a more capable flavor of regex with named groups and lookbehinds, I prefer the following:

(?<quoted>(?<=,"|^")(?:""|[\w\W]*?)*(?=",|"$))|(?<normal>(?<=,(?!")|^(?!"))[^,]*?(?=(?<!")$|(?<!"),))|(?<eol>\r\n|\n)

See this pattern in use.

Edit

(?:^"|,")(""|[\w\W]*?)(?=",|"$)|(?:^(?!")|,(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

This slightly modified pattern handles lines where the first column is empty as long as you are not using Javascript. For some reason Javascript will omit the second column with this pattern. I was unable to correctly handle this edge-case.

查看更多
等我变得足够好
4楼-- · 2019-01-03 02:55
,?\s*'.+?'|,?\s*".+?"|[^"']+?(?=,)|[^"']+  

This regex works with single and double quotes and also for one quote inside another!

查看更多
叛逆
5楼-- · 2019-01-03 02:57

Aaaand another answer here. :) Since I couldn't make the others quite work.

My solution both handles escaped quotes (double occurrences), and it does not include delimiters in the match.

Note that I have been matching against ' instead of " as that was my scenario, but simply replace them in the pattern for the same effect.

Here goes (remember to use the "ignore whitespace" flag /x if you use the commented version below) :

# Only include if previous char was start of string or delimiter
(?<=^|,)
(?:
  # 1st option: empty quoted string (,'',)
  '{2}
  |
  # 2nd option: nothing (,,)
  (?:)
  |
  # 3rd option: all but quoted strings (,123,)
  # (included linebreaks to allow multiline matching)
  [^,'\r\n]+
  |
  # 4th option: quoted strings (,'123''321',)
  # start pling
  ' 
    (?:
      # double quote
      '{2}
      |
      # or anything but quotes
      [^']+
    # at least one occurance - greedy
    )+
  # end pling
  '
)
# Only include if next char is delimiter or end of string
(?=,|$)

Single line version:

(?<=^|,)(?:'{2}|(?:)|[^,'\r\n]+|'(?:'{2}|[^']+)+')(?=,|$)

Regular expression visualization (if it works, debux has issues right now it seems - else follow the next link)

Debuggex Demo

regex101 example

查看更多
贼婆χ
6楼-- · 2019-01-03 02:58

In Java this pattern ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))" almost work for me:

String text = "\",\",\",,\",,\",asdasd a,sd s,ds ds,dasda,sds,ds,\"";
String regex = ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))";
Pattern p = Pattern.compile(regex);
String[] split = p.split(text);
for(String s:split) {
    System.out.println(s);
}

output:

","
",a,,"

",asdasd a,sd s,ds ds,dasda,sds,ds,"

Disadvantage: not work, when column have an odd number of quotes :(

查看更多
霸刀☆藐视天下
7楼-- · 2019-01-03 02:59

Yet another answer with a few extra features like support for quoted values that contain escaped quotes and CR/LF characters (single values that span multiple lines).

NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)


When multiple lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record (and potentially an OpenValue capture group if there was an open quote at the end of the line/record).

Here's the commented pattern (test it on Regexstorm.net):

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,\r\n]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL


Here's the raw pattern without all the comments or whitespace.

(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)


Here is a visualization from Debuggex.com (capture groups named for clarity): Debuggex.com visualization

Examples on how to use the regex pattern can be found on my answer to a similar question here, or on C# pad here, or here.

查看更多
登录 后发表回答