i am currently trying to create a C++ function to join two pipe divided files with over 10.000.000 records on one or two key fields.
The fiels look like
P2347|John Doe|C1234
P7634|Peter Parker|D2344
P522|Toni Stark|T288
and
P2347|Bruce Wayne|C1234
P1111|Captain America|D534
P522|Terminator|T288
To join on field 1 and 3, the expected output should show:
P2347|C1234|John Doe|Bruce Wayne
P522|T288|Toni Stark|Terminator
What I currently thinking about is using a set/array/vector to read in the files and create something like:
P2347|C1234>>John Doe
P522|T288>>Toni Stark
and
P2347|C1234>>Bruce Wayne
P522|T288>>Terminator
And then use the slip the first part as the key and match that against the second set/vector/array.
What I currently have is: Read in the first file and match the second file line by line against the set. It takes the whole line and matches it:
#include iostream>
#include fstream>
#include string>
#include set>
#include ctime>
using namespace std;
int main()
{
clock_t startTime = clock();
ifstream inf("test.txt");
set lines;
string line;
for (unsigned int i=1; std::getline(inf,line); ++i)
lines.insert(line);
ifstream inf2("test2.txt");
clock_t midTime = clock();
ofstream outputFile("output.txt");
while (getline(inf2, line))
{
if (lines.find(line) != lines.end())
outputFile > a;
return 0;
}
I am very happy for any suggestion. I am also happy to change the whole concept if there is any better (faster) way. Speed is critical as there might be even more than 10 million records.
EDIT: Another idea would be to take a map and have the key being the key - but this might be a little slower. Any suggestions?
Thanks a lot for any help!
I tried multiple ways to get this task completed, none of it was efficient so far:
Read everything into a set and parse the key fields into a format: keys >> values simulating an array type set. Parsing took a long time, but memory usage stays relatively low. Not fully developed code:
#include \
#include \
#include \
#include \
#include \
#include \
#include \
std::vector &split(const std::string &s, char delim, std::vector &elems) {
std::stringstream ss(s);
std::string item;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
}
return elems;
}
std::vector split(const std::string &s, char delim) {
std::vector elems;
split(s, delim, elems);
return elems;
}
std::string getSelectedRecords(std::string record, int position){
std::string values;
std::vector tokens = split(record, ' ');
//get position in vector
for(auto& s: tokens)
//pick last one or depending on number, not developed
values = s;
return values;
}
int main()
{
clock_t startTime = clock();
std::ifstream secondaryFile("C:/Users/Batman/Desktop/test/secondary.txt");
std::set secondarySet;
std::string record;
for (unsigned int i=1; std::getline(secondaryFile,record); ++i){
std::string keys = getSelectedRecords(record, 2);
std::string values = getSelectedRecords(record, 1);
secondarySet.insert(keys + ">>>" + values);
}
clock_t midTime = clock();
std::ifstream primaryFile("C:/Users/Batman/Desktop/test/primary.txt");
std::ofstream outputFile("C:/Users/Batman/Desktop/test/output.txt");
while (getline(primaryFile, record))
{
//rewrite find() function to go through set and find all keys (first part until >> ) and output values
std::string keys = getSelectedRecords(record, 2);
if (secondarySet.find(keys) != secondarySet.end())
outputFile > a;
return 0;
}
Instead of pipe divided it currently uses space divided, but that should not be a problem. Reading the data is very quick, but parsing it takes an awful lot of time
The other option was taking a multimap. Similar concept with key fields pointing to values, but this one is very low and memory intensive.
#include \
#include \
#include \
#include \
#include \
#include \
#include \
int main()
{
std::clock_t startTime = clock();
std::ifstream inf("C:/Users/Batman/Desktop/test/test.txt");
typedef std::multimap Map;
Map map;
std::string line;
for (unsigned int i=1; std::getline(inf,line); ++i){
//load tokens into vector
std::istringstream buffer(line);
std::istream_iterator beg(buffer), end;
std::vector tokens(beg, end);
//get keys
for(auto& s: tokens)
//std::cout >>" second;
outputFile > a;
return 0;
}
Further thoughts are: Splitting the pipe divided files into different files with one column each right when importing the data. With that I will not have to parse anything but can read in each column individually.
EDIT: optimized the first example with a recursive split function. Still >30 seconds for 100.000 records. Would like to see that faster plus the actual find() function is still missing.
Any thoughts?
Thanks!