I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data
. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data
. Some changes I had to make was to change from strcmp
to compare
for example.
However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.
I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_
at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.
I access ids_
and names_
many times in another function too, so access speed is very important. Through the use of creating a map
instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!
Example NewList.Txt
2515 ABC 23.5 32 -99 1875.7 1
1676 XYZ 12.5 31 -97 530.82 2
279 FOO 45.5 31 -96 530.8 3
OLD Code:
void converter::updateNewList(){
FILE* NewList;
char lineBuffer[100];
char* id = 0;
char* name = 0;
int l = 0;
int n;
NewList = fopen("NewList.txt","r");
if (NewList == NULL){
std::cerr << "Error in reading NewList.txt\n";
exit(EXIT_FAILURE);
}
while(!feof(NewList)){
fgets (lineBuffer , 100 , NewList); // Read line
l = 0;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
id = new char[l];
switch (l){
case 1:
n = sprintf (id, "%c", lineBuffer[0]);
break;
case 2:
n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
break;
case 3:
n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);
break;
case 4:
n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing ids from NewList.txt\n";
exit(EXIT_FAILURE);
}
l = l + 1;
int s = l;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
name = new char[l-s];
switch (l-s){
case 2:
n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
break;
case 3:
n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
break;
case 4:
n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing short name from NewList.txt\n";
exit(EXIT_FAILURE);
}
ids_.push_back ( std::string(id) );
names_.push_back(std::string(name));
}
bool isFound = false;
for (unsigned int i = 0; i < siteNames_.size(); i ++) {
isFound = false;
for (unsigned int j = 0; j < names_.size(); j ++) {
if (siteNames_[i].compare(names_[j]) == 0){
isFound = true;
}
}
}
fclose(NewList);
delete [] id;
delete [] name;
}
C++ CODE
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
unsigned int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');
info_.insert(std::pair<std::string, unsigned int>(name,id));
}
NewList.close();
}
UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!
Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.
The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.
And id and name should be
string
instead ofchar*
, and be initialized like this (assuming that you still usestring
instead ofstring*
):Except for std::string, this is a C program.
Try using fstream, and use the profiler to detect the bottle neck.
You can try to
reserve
a number ofvector
values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...
In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the
string
class, depending on how you use it.So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.
streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:
There's no need to do the whitespace-tokenizing manually.
Also, you may find this site a helpful reference: http://www.cplusplus.com/reference/iostream/ifstream/
My guess it that it should be tied to the vector<string>'s performance
About the vector
A
std::vector
works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...To confirm this easily, then use a
std::vector<std::string *>
and see if there is a difference in performance.If this is the case, they you can do one of those four things:
reserve()
to reserve enough space in the internal array, to avoid useless reallocations.std::deque
, which works almost like a vectorstd::list
(which doesn't give you random access to its items)About the string
Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).
If the ideas above are not enough, then...
The allocation of the string object's internal buffer is similar to a malloc of a
char *
, so you should see little or no differences between the two.Now, if your
char *
are in truthchar[SOME_CONSTANT_SIZE]
, then you avoid the malloc (and thus, will go faster than a std::string).Edit
After reading the updated code, I see the following problems.
reserve()
on ids_ and and names_Other comments:
int
for sizes and lengths in STL. At very least, usesize_t
. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)Edit 2
The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.
Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.