Optimising iteration and substitution over large d

2019-08-18 19:15发布

问题:

I've made a post here, yet as I got no answer as per now I thought maybe to try it also here as I've found it relevant.

I have the following code:

import pandas as pd
import numpy as np
import itertools 
from pprint import pprint

# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)

# This function will make the subsets of a list 
def subsets(m,n):
    z = []
    for i in m:
        z.append(list(itertools.combinations(i, n)))
    return(z)

# Make the subsets of size 2 
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l)) 

# Modify the pairs: 
mod=[':'.join(x) for x in Pairs]

# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod

# Make substitions
result = []
for v1, v2 in zip(t1, t2):
    out = []
    for i in t0:
        common = set(v1).intersection(i)
        if set(v1) == common:
            out.append(tuple(list(set(i) - common) + [v2]))
        else:
            out.append(tuple(i))
    result.append(out)

pprint(result, width=200)  

# Delete duplicates
d = {tuple(x): x for x in result} 
remain= list(d.values())  

What it does is as follows: First, we import the csv file we want to work with in here. You can see that it is a list of elements, for each element we find the subsets of size two. We then write a modification to the subsets and call it mod. What it does is to take say ('a','b') and convert it to 'a:b'. We then, for each pair, go through the original data and where ever we find the pairs we substitute them. Finally we delete all the duplicates as it is given.

The code works fine for small set of data. Yet the problem is that the file I have, has 30082 pairs where for each the list of ~49000 list should be scanned and pairs being replaced. I run this in Jupyter and after some time the Kernel dies. I wonder how one can optimise this?

回答1:

Tested on entire file.

Here You go:

=^..^=

import pandas as pd
import numpy as np
import itertools

# Importing the data
df=pd.read_csv('./GPr_test.csv', sep=',',header=None)

# set new data frame
df2 = pd.DataFrame()
pd.options.display.max_colwidth = 200


for index, row in df.iterrows():
    # clean data
    clean_list = [x for x in list(row.values) if str(x) != 'nan']
    # create combinations
    items_combinations = list(itertools.combinations(clean_list, 2))
    # create set combinations
    joint_items_combinations = [':'.join(x) for x in items_combinations]

    # collect rest of item names
    # handle firs element
    if index == 0:
        additional_names = list(df.loc[1].values)
        additional_names = [x for x in additional_names if str(x) != 'nan']
    else:
        additional_names = list(df.loc[index-1].values)
        additional_names = [x for x in additional_names if str(x) != 'nan']

    # get set data
    result = []
    for combination, joint_combination in zip(items_combinations, joint_items_combinations):
        set_data = [item for item in clean_list if item not in combination] + [joint_combination]
        result.append((set_data, additional_names))

    # add data to data frame
    data = pd.DataFrame({"result": result})
    df2 = df2.append(data)


df2 = df2.reset_index().drop(columns=['index'])

For rows:

chicken cinnamon    ginger  onion   soy_sauce
cardamom    coconut pumpkin

Output:

                                                                      result
0   ([ginger, onion, soy_sauce, chicken:cinnamon], [cardamom, coconut, pumpkin])
1   ([cinnamon, onion, soy_sauce, chicken:ginger], [cardamom, coconut, pumpkin])
2   ([cinnamon, ginger, soy_sauce, chicken:onion], [cardamom, coconut, pumpkin])
3   ([cinnamon, ginger, onion, chicken:soy_sauce], [cardamom, coconut, pumpkin])
4   ([chicken, onion, soy_sauce, cinnamon:ginger], [cardamom, coconut, pumpkin])
5   ([chicken, ginger, soy_sauce, cinnamon:onion], [cardamom, coconut, pumpkin])
6   ([chicken, ginger, onion, cinnamon:soy_sauce], [cardamom, coconut, pumpkin])
7   ([chicken, cinnamon, soy_sauce, ginger:onion], [cardamom, coconut, pumpkin])
8   ([chicken, cinnamon, onion, ginger:soy_sauce], [cardamom, coconut, pumpkin])
9   ([chicken, cinnamon, ginger, onion:soy_sauce], [cardamom, coconut, pumpkin])
10  ([pumpkin, cardamom:coconut], [chicken, cinnamon, ginger, onion, soy_sauce])
11  ([coconut, cardamom:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
12  ([cardamom, coconut:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])