I have many large csv files (1-10 gb each) which I'm importing into databases. For each file, I need to replace the 1st line so I can format the headers to be the column names. My current solution is:
using (var reader = new StreamReader(file))
{
using (var writer = new StreamWriter(fixed))
{
var line = reader.ReadLine();
var fixedLine = parseHeaders(line);
writer.WriteLine(fixedLine);
while ((line = reader.ReadLine()) != null)
writer.WriteLine(line);
}
}
What is a quicker way to only replace line 1 without iterating through every other line of these huge files?
If you can guarantee that fixedLine
is the same length (or less) as line
, you can update the files in-place instead of copying them.
If not, you can possibly get a little performance improvement by accessing the .BaseStream
of your StreamReader
and StreamWriter
and doing big block copies (using, say, a 32K byte buffer) to do the copying, which will at least eliminate the time spent checking every character to see if it's an end-of-line character as happens now with reader.ReadLine()
.
The only thing that can significantly speed it up is if you can really replace first line. If new first line is no longer than old one - replace (with space padding if needed) the first line carefully.
Otherwise - you have to create new file and copy the rest after first line. You may be able to optimize copying a bit by adjusting buffer sizes/explicit copy as binary/per-allocating size, but it will not change the fact that you need to copy whole file.
One more cheat if you planning to drop CSV data into DB anyway: if order does not matter you can read some lines from the beginning, replace them with new header and add the removed lines to the end of the file.
Side note: if this is one-time operation I'd simply copy files and be done with it... Debugging code that inserts data into middle of text file with potentially different encoding may not worth an effort.