I am trying to parse a simple CSV file, with data in a format such as:
20.5,20.5,20.5,0.794145,4.05286,0.792519,1
20.5,30.5,20.5,0.753669,3.91888,0.749897,1
20.5,40.5,20.5,0.701055,3.80348,0.695326,1
So, a very simple and fixed format file. I am storing each column of this data into a STL vector. As such I've tried to stay the C++ way using the standard library, and my implementation within a loop looks something like:
string field;
getline(file,line);
stringstream ssline(line);
getline( ssline, field, ',' );
stringstream fs1(field);
fs1 >> cent_x.at(n);
getline( ssline, field, ',' );
stringstream fs2(field);
fs2 >> cent_y.at(n);
getline( ssline, field, ',' );
stringstream fs3(field);
fs3 >> cent_z.at(n);
getline( ssline, field, ',' );
stringstream fs4(field);
fs4 >> u.at(n);
getline( ssline, field, ',' );
stringstream fs5(field);
fs5 >> v.at(n);
getline( ssline, field, ',' );
stringstream fs6(field);
fs6 >> w.at(n);
The problem is, this is extremely slow (there are over 1 million rows per data file), and seems to me to be a bit inelegant. Is there a faster approach using the standard library, or should I just use stdio functions? It seems to me this entire code block would reduce to a single fscanf call.
Thanks in advance!
Using 7 string streams when you can do it with just one sure doesn't help wrt. performance.
Try this instead:
string line;
getline(file, line);
istringstream ss(line); // note we use istringstream, we don't need the o part of stringstream
char c1, c2, c3, c4, c5; // to eat the commas
ss >> cent_x.at(n) >> c1 >>
cent_y.at(n) >> c2 >>
cent_z.at(n) >> c3 >>
u.at(n) >> c4 >>
v.at(n) >> c5 >>
w.at(n);
If you know the number of lines in the file, you can resize the vectors prior to reading and then use operator[]
instead of at()
. This way you avoid bounds checking and thus gain a little performance.
I believe the major bottleneck (put aside the getline()-based non-buffered I/O) is the string parsing. Since you have the "," symbol as a delimiter, you may perform a linear scan over the string and replace all "," by "\0" (the end-of-string marker, zero-terminator).
Something like this:
// tmp array for the line part values
double parts[MAX_PARTS];
while(getline(file, line))
{
size_t len = line.length();
size_t j;
if(line.empty()) { continue; }
const char* last_start = &line[0];
int num_parts = 0;
while(j < len)
{
if(line[j] == ',')
{
line[j] = '\0';
if(num_parts == MAX_PARTS) { break; }
parts[num_parts] = atof(last_start);
j++;
num_parts++;
last_start = &line[j];
}
j++;
}
/// do whatever you need with the parts[] array
}
I don't know if this will be quicker than the accepted answer, but I might as well post it anyway in case you wish to try it.
You can load in the entire contents of the file using a single read call by knowing the size of the file using some fseek magic. This will be much faster than multiple read calls.
You could then do something like this to parse your string:
//Delimited string to vector
vector<string> dstov(string& str, string delimiter)
{
//Vector to populate
vector<string> ret;
//Current position in str
size_t pos = 0;
//While the the string from point pos contains the delimiter
while(str.substr(pos).find(delimiter) != string::npos)
{
//Insert the substring from pos to the start of the found delimiter to the vector
ret.push_back(str.substr(pos, str.substr(pos).find(delimiter)));
//Move the pos past this found section and the found delimiter so the search can continue
pos += str.substr(pos).find(delimiter) + delimiter.size();
}
//Push back the final element in str when str contains no more delimiters
ret.push_back(str.substr(pos));
return ret;
}
string rawfiledata;
//This call will parse the raw data into a vector containing lines of
//20.5,30.5,20.5,0.753669,3.91888,0.749897,1 by treating the newline
//as the delimiter
vector<string> lines = dstov(rawfiledata, "\n");
//You can then iterate over the lines and parse them into variables and do whatever you need with them.
for(size_t itr = 0; itr < lines.size(); ++itr)
vector<string> line_variables = dstov(lines[itr], ",");