可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
There are lots of search results on the web (and in SO) for something similar to what I need to do, but I haven't run into a solution yet for my particular situation.
I have a comma-delimited file in which only columns that have commas within them have double quotes around them. Other fields that have no comma in them are simply separated by the comma.
Take this example:
123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"
The output for that line needs to be:
123|box,toy|phone|red,car,cat,dog|bike,pencil|man|africa|yellow|jump,rope
I have this code currently:
Using sr As New StreamReader(csvFilePath)
Dim line As String = ""
Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
Dim strReplacerQuoteComma As String = Chr(34) & ","
Dim strReplacerCommaQuote As String = "," & Chr(34)
Do While sr.Peek <> -1
line = sr.ReadLine
line = Replace(line, strReplacerQuoteCommaQuote, "|")
line = Replace(line, strReplacerQuoteComma, "|")
line = Replace(line, strReplacerCommaQuote, "|")
line = Replace(line, Chr(34), "")
Console.WriteLine("line: " & line)
Loop
End Using
The problem with that process is when I get to the fourth Replace() line the string looks like this:
123|box,toy|phone|red,car,cat,dog|bike,pencil|man,africa,yellow|jump,rope
So man and africa need pipes after them, but I can't just do a Replace on all commas, obviously.
How can I do this? Is there a RegEx statement that could handle this?
UPDATE With Working Code
The link in Avinash's comment had the answer that I went with. I Imported System.Text.RegularExpressions and used the following:
Using sr As New StreamReader(csvFilePath)
Dim line As String = ""
Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
Dim strReplacerQuoteComma As String = Chr(34) & ","
Dim strReplacerCommaQuote As String = "," & Chr(34)
Do While sr.Peek <> -1
line = sr.ReadLine
Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
Dim replacement As String = "|"
Dim regEx As New Regex(pattern)
Dim newLine As String = regEx.Replace(line, replacement)
newLine = newLine.Replace(Chr(34), "")
Console.WriteLine("newLine: " & newLine)
Loop
End Using
回答1:
This seems to work for your example:
Dim result = Regex.Replace(input, ",(?=([^""]*""[^""]*"")*[^""]*$)", Function(m) m.Value.Replace(",", "|"))
result = result.Replace(Chr(34), "")
See the accepted answer here for the regex explanation and be sure to upvote @mathematical.coffee's answer there while you're at it as I basically just stole his regex.
Edit:
Regarding your performance issue, I created a file with 90k lines of:
abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,yellow,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
which roughly equals a file size of 35MB, my laptop (nothing special) will parse that in around 6.5 seconds.
Yes, the regex is slow and the TextFieldParser class is also widely reported as not being the quickest but you obviously have some other bottleneck in your code if you are still processing for over 5 minutes. Please note that I am not actually doing anything with the parsed result though.
Edit 2: Ok, I thought I'd have one last go at this (I was bored this morning) but I still can't replicate your extended conversion time.
Time to get brutal, I created an input file with 150k lines of:
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
Each line has 1140 characters, total file size ~167MB.
Reading, converting and writing back out to a new file using the following code took 29 seconds.
Dim line, result As String
Dim replace As String = ",(?=([^""]*""[^""]*"")*[^""]*$)"
Using sw As New StreamWriter("d:\output.txt")
Using sr As New StreamReader("d:\input.txt")
While Not sr.EndOfStream
line = sr.ReadLine
result = Regex.Replace(line, replace, Function(m) m.Value.Replace(",", "|"))
sw.WriteLine(result.Replace(Chr(34), ""))
End While
End Using
End Using
Edit 3: Using @sln's regex and this code cuts the processing time for the same file to 4 seconds.
Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
Using sw As New StreamWriter("d:\output.txt")
Using sr As New StreamReader("d:\input.txt")
While Not sr.EndOfStream
line = sr.ReadLine
result = rgx.Replace(line, replacement)
sw.WriteLine(result.Replace(Chr(34), ""))
End While
End Using
End Using
So there you go, I think you have a winner. As sln states, this is a relative test so machine speed is irrelevant.
,(?=([^"]*"[^"]*")*[^"]*$) took 29 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$) took 4 seconds
Finally (and just for completeness) the solution proposed by @jawood2005 is very workable:
Dim line As String
Dim fields As String()
Using sw As New StreamWriter("d:\output.txt")
Using tfp As New FileIO.TextFieldParser("d:\input.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {","}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
fields = tfp.ReadFields
line = String.Join("|", fields)
sw.WriteLine(line.Replace(Chr(34), ""))
End While
End Using
End Using
Using the same 150k row input file as the regex solutions this completes in 18 seconds, so better than mine but sln wins the prize for the fastest solution to your problem.
回答2:
Bullet proof way.
# Validate even quotes (one time match): ^[^"]*(?:"[^"]*"[^"]*)*$
# Then ->
# ----------------------------------------------
# Find: /,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/
# Replace: '|$1'
,
( # (1 start)
[^,"]*
(?: " [^"]* " )?
[^,"]*
) # (1 end)
(?= , | $ )
Benchmarks
Since @TheBlueDog posted a Benchmark ('Edit 2'), I thought I would post a
benchmark as well.
Its based on his input, and the intent is to demonstrate the evils of using
'to-the-end-of-string' lookahead as a validation technique
(ie. this -> ^[^"]*(?:"[^"]*"[^"]*)*$
)
Blue Dog's regex replace method is hampered a bit with an unnecessary callback, so I
imagine that accounts for some of his bad numbers.
Don't know Vb.net so this is done in Perl. The machine speed and language are factored out
because its a relative test.
Summary:
,(?=([^"]*"[^"]*")*[^"]*$) took 10 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$) took 2 seconds
This represents a 5 fold difference.
Benchmark in Perl, 150K line (167MB file):
use strict;
use warnings;
use Benchmark ':hireswallclock';
my ($t0,$t1);
my ($infile, $outfile);
my $tstr = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
';
# =================================================
print "\nMaking 150K line (167MB file), csv_data_in.txt ...";
open( $infile, ">", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for writing $!";
for (1 .. 150_000)
{
print $infile $tstr;
}
close( $infile );
print "\nDone !\n\n";
# =================================================
print "Converting delimiters, writing to csv_data_out.txt ...";
open( $infile, "<", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for readimg $!";
open( $outfile, ">", 'csv_data_out.txt' ) or die "can't open 'csv_data_out.txt' for writing $!";
my $line = '';
$t0 = new Benchmark;
while( $line = <$infile> )
{
# Validation - Uncomment to check line for even quotes, otherwise don't
# if ( $line =~ /^[^"]*(?:"[^"]*"[^"]*)*$/ )
# {
$line =~ s/,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/|$1/g;
# }
print $outfile $line;
}
$t1 = new Benchmark;
close( $infile );
close( $outfile );
print "\nDone !\n";
print "Conversion took: ", timestr(timediff($t1, $t0)), "\n\n";
Output:
Making 150K line (167MB file), csv_data_in.txt ...
Done !
Converting delimiters, writing to csv_data_out.txt ...
Done !
Conversion took: 2.1216 wallclock secs ( 1.87 usr + 0.17 sys = 2.04 CPU)
回答3:
This may not be the best solution, but it should work...
I'm 99% sure you're using StreamReader ("sr") to read the file. Try reading it in with a FileIO.TextFieldParser, which will allow you to split the line into string array.
Dim aFile As FileIO.TextFieldParser = New FileIO.TextFieldParser(filePath)
Dim temp() As String ' this array will hold each line of data
Dim order As doOrder = Nothing
Dim orderID As Integer
Dim myDate As DateTime = Now.ToString
aFile.TextFieldType = FileIO.FieldType.Delimited
aFile.Delimiters = New String() {","}
aFile.HasFieldsEnclosedInQuotes = True
temp = aFile.ReadFields
' parse the actual file
Do While Not aFile.EndOfData...
In the loop, keep using "aFile.ReadFields" to read the next line. Once you have your String array, you can concatenate each field with pipes in between them. Kind of messy, and not Regex (don't know if that's an actual condition or just an idea), but will get the job done.
Also, please note the "aFile.HasFieldsEnclosedInQuotes = True" since that is one of the conditions you listed.
EDIT: I see that The Blue Dog gave a Regex answer while I was trying to type... You may still want to look into using the TextFieldParser regardless since you are reading a delimited file. I'll just go away now.