How to compare two columns from the same data set?

2019-08-25 07:58发布

I have a data set with 6 columns and 4.5 million rows, and I want to iterate through all the data set to compare the value of the last column with the value of the 1st column for each and every row in my data set and append the rows whose last column value matches the value of first column of a row to that row.

The first solution that came to my mind was using .iter from pandas, but I think it is too slow for large data sets.

let's assume this is my data set:

x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']] 

the result should look something like this:

[['2', 'Jack', '8', '1', 'Ali', '2', '5' , 'gabe' , '2','7' , 'Ali' , '2'],
 ['1', 'Ali', '2', '4' , 'sgee' , '1'],
['8' , 'nobody' , '20', '2', 'Jack', '8']]

The code I have tried is:

for line in x:
    arow=line
    for row in x:
        brow=row
        if line[2]==row[0]:
            brow.extend(arow)
            table.append(brow)

print(table)

but the results seem to repeat:

[['8', 'nobody', '20', '2', 'Jack', '8'],
 ['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
 ['1', 'Ali', '2', '4', 'sgee', '1'],
 ['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'], 
['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2']]

标签: python
2条回答
Ridiculous、
2楼-- · 2019-08-25 08:26

You could try using defaultdict:

from collections import defaultdict
from pprint import pprint

x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]

d = defaultdict(list)

for v in x:
    d[v[0]] += v
    d[v[-1]] += v

pprint([v for v in d.values() if len(v) > 3])

Prints:

[['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
 ['2', 'Jack', '8', '8', 'nobody', '20'],
 ['1', 'Ali', '2', '4', 'sgee', '1']]
查看更多
唯我独甜
3楼-- · 2019-08-25 08:31

you could try using numpy, but this will take on the order of 10s of minutes.

import numpy as np
import time


x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
     ['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
     ['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']] 

xArr = np.array(x)
st = time.time()
newList = []
for kk,i in enumerate(xArr):

    matches = np.where(xArr[:,-1]==i[0])[0]
    if len(matches)!=0:
        newList.append(np.concatenate([i,xArr[matches].flatten()]))

print('Runtime',time.time() - st)
查看更多
登录 后发表回答