How to compare two columns from the same data set?

I have a data set with 6 columns and 4.5 million rows, and I want to iterate through all the data set to compare the value of the last column with the value of the 1st column for each and every row in my data set and append the rows whose last column value matches the value of first column of a row to that row.

The first solution that came to my mind was using .iter from pandas, but I think it is too slow for large data sets.

let's assume this is my data set:

x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]

the result should look something like this:

[['2', 'Jack', '8', '1', 'Ali', '2', '5' , 'gabe' , '2','7' , 'Ali' , '2'],
 ['1', 'Ali', '2', '4' , 'sgee' , '1'],
['8' , 'nobody' , '20', '2', 'Jack', '8']]

The code I have tried is:

for line in x:
    arow=line
    for row in x:
        brow=row
        if line[2]==row[0]:
            brow.extend(arow)
            table.append(brow)

print(table)

but the results seem to repeat:

[['8', 'nobody', '20', '2', 'Jack', '8'],
 ['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
 ['1', 'Ali', '2', '4', 'sgee', '1'],
 ['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'], 
['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2']]

标签： python

2条回答

Ridiculous、

2楼-- · 2019-08-25 08:26

You could try using defaultdict:

from collections import defaultdict
from pprint import pprint

x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]

d = defaultdict(list)

for v in x:
    d[v[0]] += v
    d[v[-1]] += v

pprint([v for v in d.values() if len(v) > 3])

Prints:

[['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
 ['2', 'Jack', '8', '8', 'nobody', '20'],
 ['1', 'Ali', '2', '4', 'sgee', '1']]

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2019-08-25 08:31

you could try using numpy, but this will take on the order of 10s of minutes.

import numpy as np
import time


x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
     ['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
     ['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']] 

xArr = np.array(x)
st = time.time()
newList = []
for kk,i in enumerate(xArr):

    matches = np.where(xArr[:,-1]==i[0])[0]
    if len(matches)!=0:
        newList.append(np.concatenate([i,xArr[matches].flatten()]))

print('Runtime',time.time() - st)

0人赞添加讨论(0) 举报

How to compare two columns from the same data set?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间