How to convert comma-delimited file to pipe-delimi

2019-06-08 23:24发布

问题:

There are lots of search results on the web (and in SO) for something similar to what I need to do, but I haven't run into a solution yet for my particular situation.

I have a comma-delimited file in which only columns that have commas within them have double quotes around them. Other fields that have no comma in them are simply separated by the comma.

Take this example:

123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"

The output for that line needs to be:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man|africa|yellow|jump,rope

I have this code currently:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        line = Replace(line, strReplacerQuoteCommaQuote, "|")
        line = Replace(line, strReplacerQuoteComma, "|")
        line = Replace(line, strReplacerCommaQuote, "|")
        line = Replace(line, Chr(34), "")

        Console.WriteLine("line: " & line)
    Loop
End Using

The problem with that process is when I get to the fourth Replace() line the string looks like this:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man,africa,yellow|jump,rope

So man and africa need pipes after them, but I can't just do a Replace on all commas, obviously.

How can I do this? Is there a RegEx statement that could handle this?

UPDATE With Working Code

The link in Avinash's comment had the answer that I went with. I Imported System.Text.RegularExpressions and used the following:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
        Dim replacement As String = "|"
        Dim regEx As New Regex(pattern)

        Dim newLine As String = regEx.Replace(line, replacement)
        newLine = newLine.Replace(Chr(34), "")

        Console.WriteLine("newLine: " & newLine)
    Loop
End Using

回答1:

This seems to work for your example:

Dim result = Regex.Replace(input, ",(?=([^""]*""[^""]*"")*[^""]*$)", Function(m) m.Value.Replace(",", "|"))
result = result.Replace(Chr(34), "")

See the accepted answer here for the regex explanation and be sure to upvote @mathematical.coffee's answer there while you're at it as I basically just stole his regex.

Edit: Regarding your performance issue, I created a file with 90k lines of:

abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,yellow,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

which roughly equals a file size of 35MB, my laptop (nothing special) will parse that in around 6.5 seconds.

Yes, the regex is slow and the TextFieldParser class is also widely reported as not being the quickest but you obviously have some other bottleneck in your code if you are still processing for over 5 minutes. Please note that I am not actually doing anything with the parsed result though.

Edit 2: Ok, I thought I'd have one last go at this (I was bored this morning) but I still can't replicate your extended conversion time.

Time to get brutal, I created an input file with 150k lines of:

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

Each line has 1140 characters, total file size ~167MB.

Reading, converting and writing back out to a new file using the following code took 29 seconds.

Dim line, result As String
Dim replace As String = ",(?=([^""]*""[^""]*"")*[^""]*$)"
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = Regex.Replace(line, replace, Function(m) m.Value.Replace(",", "|"))
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

Edit 3: Using @sln's regex and this code cuts the processing time for the same file to 4 seconds.

Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = rgx.Replace(line, replacement)
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

So there you go, I think you have a winner. As sln states, this is a relative test so machine speed is irrelevant.

,(?=([^"]*"[^"]*")*[^"]*$)          took 29 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 4 seconds

Finally (and just for completeness) the solution proposed by @jawood2005 is very workable:

Dim line As String
Dim fields As String()
Using sw As New StreamWriter("d:\output.txt")
    Using tfp As New FileIO.TextFieldParser("d:\input.txt")
        tfp.TextFieldType = FileIO.FieldType.Delimited
        tfp.Delimiters = New String() {","}
        tfp.HasFieldsEnclosedInQuotes = True
        While Not tfp.EndOfData
            fields = tfp.ReadFields
            line = String.Join("|", fields)
            sw.WriteLine(line.Replace(Chr(34), ""))
        End While
    End Using
End Using

Using the same 150k row input file as the regex solutions this completes in 18 seconds, so better than mine but sln wins the prize for the fastest solution to your problem.



回答2:

Bullet proof way.

 # Validate even quotes (one time match):  ^[^"]*(?:"[^"]*"[^"]*)*$   
 # Then ->
 # ----------------------------------------------
 # Find:  /,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/
 # Replace:  '|$1'

 ,
 (                             # (1 start)
      [^,"]*  
      (?: " [^"]* " )?
      [^,"]*  
 )                             # (1 end)
 (?= , | $ )

Benchmarks

Since @TheBlueDog posted a Benchmark ('Edit 2'), I thought I would post a
benchmark as well.

Its based on his input, and the intent is to demonstrate the evils of using
'to-the-end-of-string' lookahead as a validation technique
(ie. this -> ^[^"]*(?:"[^"]*"[^"]*)*$)

Blue Dog's regex replace method is hampered a bit with an unnecessary callback, so I
imagine that accounts for some of his bad numbers.

Don't know Vb.net so this is done in Perl. The machine speed and language are factored out
because its a relative test.

Summary:

,(?=([^"]*"[^"]*")*[^"]*$)          took 10 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 2 seconds  

This represents a 5 fold difference.

Benchmark in Perl, 150K line (167MB file):

use strict;
use warnings;

use Benchmark ':hireswallclock';
my ($t0,$t1);
my ($infile, $outfile);

my $tstr = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
';

# =================================================
print "\nMaking 150K line (167MB file), csv_data_in.txt ...";

open( $infile, ">", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for writing $!";
for (1 .. 150_000)
{
   print $infile $tstr;
}
close( $infile );

print "\nDone !\n\n";

# =================================================
print "Converting delimiters, writing to csv_data_out.txt ...";

open( $infile, "<", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for readimg $!";
open( $outfile, ">", 'csv_data_out.txt' ) or die "can't open 'csv_data_out.txt' for writing $!";

my $line = '';

$t0 = new Benchmark;
while( $line = <$infile> )
{
    # Validation - Uncomment to check line for even quotes, otherwise don't
    # if ( $line =~ /^[^"]*(?:"[^"]*"[^"]*)*$/ )
    # {
        $line =~ s/,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/|$1/g;
    # }
    print $outfile $line;
}
$t1 = new Benchmark;

close( $infile );
close( $outfile );

print "\nDone !\n";
print "Conversion took: ", timestr(timediff($t1, $t0)), "\n\n";

Output:

Making 150K line (167MB file), csv_data_in.txt ...
Done !

Converting delimiters, writing to csv_data_out.txt ...
Done !
Conversion took: 2.1216 wallclock secs ( 1.87 usr +  0.17 sys =  2.04 CPU)


回答3:

This may not be the best solution, but it should work...

I'm 99% sure you're using StreamReader ("sr") to read the file. Try reading it in with a FileIO.TextFieldParser, which will allow you to split the line into string array.

Dim aFile As FileIO.TextFieldParser = New FileIO.TextFieldParser(filePath)
Dim temp() As String ' this array will hold each line of data
Dim order As doOrder = Nothing
Dim orderID As Integer
Dim myDate As DateTime = Now.ToString

aFile.TextFieldType = FileIO.FieldType.Delimited
aFile.Delimiters = New String() {","}
aFile.HasFieldsEnclosedInQuotes = True

temp = aFile.ReadFields

' parse the actual file
Do While Not aFile.EndOfData...

In the loop, keep using "aFile.ReadFields" to read the next line. Once you have your String array, you can concatenate each field with pipes in between them. Kind of messy, and not Regex (don't know if that's an actual condition or just an idea), but will get the job done.

Also, please note the "aFile.HasFieldsEnclosedInQuotes = True" since that is one of the conditions you listed.

EDIT: I see that The Blue Dog gave a Regex answer while I was trying to type... You may still want to look into using the TextFieldParser regardless since you are reading a delimited file. I'll just go away now.