I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.
So my options appear to be:
- Regex.Split
- TextFieldParser
- OLEDB CSV Parser
I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):
101, Bob, "Keeps his house ""clean"".
Needs to work on laundry."
102, Amy, "Brilliant.
Driven.
Diligent."
The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.
The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:
var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";
A proper parsing of string "first" would be:
- This
- Is,A,Record
- That "Cannot", they say,
- _
- _
- be
- rightly
- parsed
- at all
The '_' simply means that a blank was captured - I don't want a literal underbar to appear.
One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.
Now for a dive into the technical options.
REGEX
First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:
var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();
The results, applied to string "first," are quite wonderful:
- "This"
- "Is,A,Record"
- "That ""Cannot"", they say,"
- ""
- _
- "be"
- rightly
- "parsed"
- at all
It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!
But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.
So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.
On to the next option.
TextFieldParser
Three lines of code will make the problem with this option apparent:
var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;
The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.
OLEDB
A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.
HELP
So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).
What say ye?
Take a look at the code I posted to this question:
It covers most of your requirements, and it wouldn't take much to update it to support alternate delimiters or text qualifiers.
Did you try searching for an already-existing .NET CSV parser? This one claims to handle multi-line records significantly faster than OLEDB.
I wrote this a while back as a lightweight, standalone CSV parser. I believe it meets all of your requirements. Give it a try with the knowledge that it probably isn't bulletproof.
If it does work for you, feel free to change the namespace and use without restriction.