I have a data set with 6 columns and 4.5 million rows, and I want to iterate through all the data set to compare the value of the last column with the value of the 1st column for each and every row in my data set and append the rows whose last column value matches the value of first column of a row to that row.
The first solution that came to my mind was using .iter
from pandas, but I think it is too slow for large data sets.
let's assume this is my data set:
x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]
the result should look something like this:
[['2', 'Jack', '8', '1', 'Ali', '2', '5' , 'gabe' , '2','7' , 'Ali' , '2'],
['1', 'Ali', '2', '4' , 'sgee' , '1'],
['8' , 'nobody' , '20', '2', 'Jack', '8']]
The code I have tried is:
for line in x:
arow=line
for row in x:
brow=row
if line[2]==row[0]:
brow.extend(arow)
table.append(brow)
print(table)
but the results seem to repeat:
[['8', 'nobody', '20', '2', 'Jack', '8'],
['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
['1', 'Ali', '2', '4', 'sgee', '1'],
['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2']]
You could try using
defaultdict
:Prints:
you could try using numpy, but this will take on the order of 10s of minutes.