I have a CSV file that has rows resembling this:
1, 4, 2, "PUBLIC, JOHN Q" ,ACTIVE , 1332
I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:
1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332
I thought this would be rather easy: I made the expression ([ \t]+,)
and replaced it with ,
. I made a complement expression (,[ \t]+)
with a replacement of ,
and I thought I had achieved a good means of right-trimming and left-trimming strings.
...but then I noticed that my "PUBLIC, JOHN Q"
was now "PUBLIC,JOHN Q"
which isn't what I wanted. (Note the space following the comma is now gone).
What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?
UPDATE
To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.
If the engine used by your tool is the C# regular expression engine, then you can try the following expression:
(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)
replace with empty string.
The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.
My expression looks for all spaces that are not part of a quoted value.
RegexHero Demo
Something like this might do the job:
(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*
Which matches [\t ]*,[ \t]*
, only when not preceded by an odd number of quotes.
Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.
But if you really insist on a regex, you can use this one:
"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
And replace it with empty string - ""
This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.
(?x) # Ignore Whitespace
\s+ # One or more whitespace characters
(?= # Followed by
( # A group - This group captures even number of quotes
[^\"]* # Zero or more non-quote characters
\" # A quote
[^\"]* # Zero or more non-quote characters
\" # A quote
)* # Zero or more repetition of previous group
[^\"]* # Zero or more non-quote characters
$ # Till the end
) # Look-ahead end
string format(string val)
{
if (val.StartsWith("\"")) val = " " + val;
string[] vals = val.Split('\"');
for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
return string.Join("\t", vals);
}
This will work if you have properly closed quoted strings in between
Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).
public static string ReduceSpaces( string input )
{
char[] a = input.ToCharArray();
int placeComma = 0, placeOther = 0;
bool inQuotes = false;
bool followedComma = true;
foreach( char c in a ) {
inQuotes ^= (c == '\"');
if (c == ' ') {
if (!followedComma)
a[placeOther++] = c;
}
else if (c == ',') {
a[placeComma++] = c;
placeOther = placeComma;
followedComma = true;
}
else {
a[placeOther++] = c;
placeComma = placeOther;
followedComma = false;
}
}
return new String(a, 0, placeComma);
}
Demo: http://ideone.com/NEKm09