Faster way to create tab deliminated text files?

2020-06-17 14:40发布

Many of my programs output huge volumes of data for me to review on Excel. The best way to view all these files is to use a tab deliminated text format. Currently i use this chunk of code to get it done:

ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << " ";
    output << endl;
}

This seems to be a very slow operation, is a more efficient way of outputting text files like this to the hard drive?

Update:

Taking the two suggestions into mind, the new code is this:

ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << "\t";
    output << "\n";
}
output.close();

writes to HD at 500KB/s

But this writes to HD at 50MB/s

{
    output.open(fileName.c_str(), std::ios::binary | std::ios::out);
    output.write(reinterpret_cast<char*>(arrayPointer), std::streamsize(dim * dim * sizeof(double)));
    output.close();
}

标签: c++
6条回答
时光不老,我们不散
2楼-- · 2020-06-17 15:18

I decided to test JPvdMerwe's claim that C stdio is faster than C++ IO streams. (Spoiler: yes, but not necessarily by much.) To do this, I used the following test programs:

Common wrapper code, omitted from programs below:

#include <iostream>
#include <cstdio>
int main (void) {
  // program code goes here
}

Program 1: normal synchronized C++ IO streams

for (int j = 0; j < ROWS; j++) {
  for (int i = 0; i < COLS; i++) {
    std::cout << (i-j) << "\t";
  }
  std::cout << "\n";
}

Program 2: unsynchronized C++ IO streams

Same as program 1, except with std::cout.sync_with_stdio(false); prepended.

Program 3: C stdio printf()

for (int j = 0; j < ROWS; j++) {
  for (int i = 0; i < COLS; i++) {
    printf("%d\t", i-j);
  }
  printf("\n");
}

All programs were compiled with GCC 4.8.4 on Ubuntu Linux, using the following command:

g++ -Wall -ansi -pedantic -DROWS=10000 -DCOLS=1000 prog.cpp -o prog

and timed using the command:

time ./prog > /dev/null

Here are the results of the test on my laptop (measured in wall clock time):

  • Program 1 (synchronized C++ IO): 3.350s (= 100%)
  • Program 2 (unsynchronized C++ IO): 3.072s (= 92%)
  • Program 3 (C stdio): 2.592s (= 77%)

I also ran the same test with g++ -O2 to test the effect of optimization, and got the following results:

  • Program 1 (synchronized C++ IO) with -O2: 3.118s (= 100%)
  • Program 2 (unsynchronized C++ IO) with -O2: 2.943s (= 94%)
  • Program 3 (C stdio) with -O2: 2.734s (= 88%)

(The last line is not a fluke; program 3 consistently runs slower for me with -O2 than without it!)

Thus, my conclusion is that, based on this test, C stdio is indeed about 10% to 25% faster for this task than (synchronized) C++ IO. Using unsynchronized C++ IO saves about 5% to 10% over synchronized IO, but is still slower than stdio.


Ps. I tried a few other variations, too:

  • Using std::endl instead of "\n" is, as expected, slightly slower, but the difference is less than 5% for the parameter values given above. However, printing more but shorter output lines (e.g. -DROWS=1000000 -DCOLS=10) makes std::endl more than 30% slower than "\n".

  • Piping the output to a normal file instead of /dev/null slows down all the programs by about 0.2s, but makes no qualitative difference to the results.

  • Increasing the line count by a factor of 10 also yields no surprises; the programs all take about 10 times longer to run, as expected.

  • Prepending std::cout.sync_with_stdio(false); to program 3 has no noticeable effect.

  • Using (double)(i-j) (and "%g\t" for printf()) slows down all three programs a lot! Notably, program 3 is still fastest, taking only 9.3s where programs 1 and 2 each took a bit over 14s, a speedup of nearly 40%! (And yes, I checked, the outputs are identical.) Using -O2 makes no significant difference either.

查看更多
做自己的国王
3楼-- · 2020-06-17 15:21

does it have to be written in C? if not, there are many tools already written in C, eg (g)awk (can be used in unix/windows) that does the job of file parsing really well, also on big files.

awk '{$1=$1}1' OFS="\t" file
查看更多
Melony?
4楼-- · 2020-06-17 15:31
ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << '\t';
    output << endl;
}

Use '\t' instead of " "

查看更多
仙女界的扛把子
5楼-- · 2020-06-17 15:32

Use C IO, it's a lot faster than C++ IO. I've heard of people in programming contests timing out purely because they used C++ IO and not C IO.

#include <cstdio>

FILE* fout = fopen(fileName.c_str(), "w");

for (int j = 0; j < dim; j++) 
{ 
    for (int i = 0; i < dim; i++) 
        fprintf(fout, "%d\t", arrayPointer[j * dim + i]); 
    fprintf(fout, "\n");
} 
fclose(fout);

Just change %d to be the correct type.

查看更多
爱情/是我丢掉的垃圾
6楼-- · 2020-06-17 15:34

Don't use endl. It will be flushing the stream buffers, which is potentially very inefficient. Instead:

output << '\n';
查看更多
▲ chillily
7楼-- · 2020-06-17 15:34

It may be faster to do it this way:

ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << '\t';
    output << '\n';
}
查看更多
登录 后发表回答