I know this (or similar) has been asked many times but having tried out numerous possibilities I've not been able to find a a regex that works 100%.
I've got a CSV file and I'm trying to split it into an array, but encountering two problems: quoted commas and empty elements.
The CSV looks like:
123,2.99,AMO024,Title,"Description, more info",,123987564
The regex I've tried to use is:
thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
The only problem is that in my output array the 5th element comes out as 123987564 and not an empty string.
I personally tried many RegEx expressions without having found the perfect one that match all cases.
I think that regular expressions is hard to configure properly to match all cases properly. Although few persons will not like the namespace (and I was part of them), I propose something that is part of the .Net framework and give me proper results all the times in all cases (mainly managing every double quotes cases very well):
Microsoft.VisualBasic.FileIO.TextFieldParser
Found it here: StackOverflow
Example of usage:
Hope it could help.
I'm late to the party, but the following is the Regular Expression I use:
This pattern has three capturing groups:
This pattern handles all of the following:
See this pattern in use.
If you have are using a more capable flavor of regex with named groups and lookbehinds, I prefer the following:
See this pattern in use.
Edit
This slightly modified pattern handles lines where the first column is empty as long as you are not using Javascript. For some reason Javascript will omit the second column with this pattern. I was unable to correctly handle this edge-case.
This regex works with single and double quotes and also for one quote inside another!
Aaaand another answer here. :) Since I couldn't make the others quite work.
My solution both handles escaped quotes (double occurrences), and it does not include delimiters in the match.
Note that I have been matching against
'
instead of"
as that was my scenario, but simply replace them in the pattern for the same effect.Here goes (remember to use the "ignore whitespace" flag
/x
if you use the commented version below) :Single line version:
Debuggex Demo
regex101 example
In Java this pattern
",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))"
almost work for me:output:
Disadvantage: not work, when column have an odd number of quotes :(
Yet another answer with a few extra features like support for quoted values that contain escaped quotes and CR/LF characters (single values that span multiple lines).
NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)
When multiple lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named
Value
that contains the captured values in that line/record (and potentially anOpenValue
capture group if there was an open quote at the end of the line/record).Here's the commented pattern (test it on Regexstorm.net):
Here's the raw pattern without all the comments or whitespace.
Here is a visualization from Debuggex.com (capture groups named for clarity):
Examples on how to use the regex pattern can be found on my answer to a similar question here, or on C# pad here, or here.