Regular expression to remove whitespace around a c

2019-07-15 08:29发布

问题:

I have a CSV file that has rows resembling this:

1,  4,     2, "PUBLIC, JOHN Q" ,ACTIVE , 1332

I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:

1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332

I thought this would be rather easy: I made the expression ([ \t]+,) and replaced it with ,. I made a complement expression (,[ \t]+) with a replacement of , and I thought I had achieved a good means of right-trimming and left-trimming strings.

...but then I noticed that my "PUBLIC, JOHN Q" was now "PUBLIC,JOHN Q" which isn't what I wanted. (Note the space following the comma is now gone).

What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?

UPDATE

To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.

回答1:

If the engine used by your tool is the C# regular expression engine, then you can try the following expression:

(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)

replace with empty string.

The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.

My expression looks for all spaces that are not part of a quoted value.

RegexHero Demo



回答2:

Something like this might do the job:

(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*

Which matches [\t ]*,[ \t]*, only when not preceded by an odd number of quotes.



回答3:

Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.

But if you really insist on a regex, you can use this one:

"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"

And replace it with empty string - ""

This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.

(?x)       # Ignore Whitespace
\s+        # One or more whitespace characters       
(?=        # Followed by 
   (          # A group - This group captures even number of quotes
     [^\"]*     # Zero or more non-quote characters
     \"         # A quote
     [^\"]*     # Zero or more non-quote characters
     \"         # A quote 
   )*         # Zero or more repetition of previous group
   [^\"]*     # Zero or more non-quote characters
   $          # Till the end
)          # Look-ahead end


回答4:

        string format(string val)
        {
            if (val.StartsWith("\"")) val = " " + val;
            string[] vals = val.Split('\"');
            for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
            return string.Join("\t", vals);
        }

This will work if you have properly closed quoted strings in between



回答5:

Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).

public static string ReduceSpaces( string input )
{
    char[] a = input.ToCharArray();
    int placeComma = 0, placeOther = 0;
    bool inQuotes = false;
    bool followedComma = true;
    foreach( char c in a ) {
        inQuotes ^= (c == '\"');
        if (c == ' ') {
            if (!followedComma)
                a[placeOther++] = c;
        }
        else if (c == ',') {
            a[placeComma++] = c;
            placeOther = placeComma;
            followedComma = true;
        }
        else {
            a[placeOther++] = c;
            placeComma = placeOther;
            followedComma = false;
        }
    }
    return new String(a, 0, placeComma);
}

Demo: http://ideone.com/NEKm09