C# Replace with regex

2019-07-22 05:44发布

问题:

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.

EDIT: Per comments this code block has been changed.

var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

fileContents = fileContents.Replace(fileContents, @"regex", "");

regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);

My files are formatted like this:

"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,

So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso

(?<=,")([^"]+,[^"]+)(?=",)

I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?

SOLVED: Combined [^"]+ with look behind/ahead:

(?<=,"[^"]+)(,)(?=[^"]+",)

FINAL EDIT: Here's my final complete solution:

//read file contents
var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");

//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));

//write result back to file
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);

回答1:

I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text. This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.

I find keeping your regexes simple will pay benefits when you're trying to maintain them later.

Note: this is similar to the answer by @Florian, but this replace restricts itself to replacement only in the matched text.

string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp); 
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))


回答2:

Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",

(?<=,"[^"]+)(,)(?=[^"]+",)



回答3:

Try to parse out all your columns with this:

 Regex regex = new Regex("(?<=\").*?(?=\")");

Then you can just do:

 foreach(Match match in regex.Matches(filecontents))
 {
      fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
 }

Might not be as fast but should work.



回答4:

What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.

While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.

 (?<=,"[^"]*),(?=[^"]*",)

Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:

"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,

In this case you are going to be out of luck with a regular expression.

Thankfully the code to deal with CSV files is very simple:

    public static IList<string> ParseCSVLine(string csvLine)
    {
        List<string> result = new List<string>();
        StringBuilder buffer = new StringBuilder();

        bool inQuotes = false;
        char lastChar = '\0';

        foreach (char c in csvLine)
        {
            switch (c)
            {
                case '"':
                    if (inQuotes)
                    {
                        inQuotes = false;
                    }
                    else
                    {
                        if (lastChar == '"')
                        {
                            buffer.Append('"');
                        }
                        inQuotes = true;
                    }
                    break;

                case ',':
                    if (inQuotes)
                    {
                        buffer.Append(',');
                    }
                    else
                    {
                        result.Add(buffer.ToString());
                        buffer.Clear();
                    }
                    break;

                default:
                    buffer.Append(c);
                    break;
            }

            lastChar = c;
        }
        result.Add(buffer.ToString());
        buffer.Clear();

        return result;
    }

PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.