I've made a post here, yet as I got no answer as per now I thought maybe to try it also here as I've found it relevant.
I have the following code:
import pandas as pd
import numpy as np
import itertools
from pprint import pprint
# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)
# This function will make the subsets of a list
def subsets(m,n):
z = []
for i in m:
z.append(list(itertools.combinations(i, n)))
return(z)
# Make the subsets of size 2
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l))
# Modify the pairs:
mod=[':'.join(x) for x in Pairs]
# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod
# Make substitions
result = []
for v1, v2 in zip(t1, t2):
out = []
for i in t0:
common = set(v1).intersection(i)
if set(v1) == common:
out.append(tuple(list(set(i) - common) + [v2]))
else:
out.append(tuple(i))
result.append(out)
pprint(result, width=200)
# Delete duplicates
d = {tuple(x): x for x in result}
remain= list(d.values())
What it does is as follows: First, we import the csv file we want to work with in here. You can see that it is a list of elements, for each element we find the subsets of size two. We then write a modification to the subsets and call it mod
. What it does is to take say ('a','b')
and convert it to 'a:b'
. We then, for each pair, go through the original data and where ever we find the pairs we substitute them. Finally we delete all the duplicates as it is given.
The code works fine for small set of data. Yet the problem is that the file I have, has 30082 pairs where for each the list of ~49000 list should be scanned and pairs being replaced. I run this in Jupyter and after some time the Kernel dies. I wonder how one can optimise this?
Tested on entire file.
Here You go:
=^..^=
import pandas as pd
import numpy as np
import itertools
# Importing the data
df=pd.read_csv('./GPr_test.csv', sep=',',header=None)
# set new data frame
df2 = pd.DataFrame()
pd.options.display.max_colwidth = 200
for index, row in df.iterrows():
# clean data
clean_list = [x for x in list(row.values) if str(x) != 'nan']
# create combinations
items_combinations = list(itertools.combinations(clean_list, 2))
# create set combinations
joint_items_combinations = [':'.join(x) for x in items_combinations]
# collect rest of item names
# handle firs element
if index == 0:
additional_names = list(df.loc[1].values)
additional_names = [x for x in additional_names if str(x) != 'nan']
else:
additional_names = list(df.loc[index-1].values)
additional_names = [x for x in additional_names if str(x) != 'nan']
# get set data
result = []
for combination, joint_combination in zip(items_combinations, joint_items_combinations):
set_data = [item for item in clean_list if item not in combination] + [joint_combination]
result.append((set_data, additional_names))
# add data to data frame
data = pd.DataFrame({"result": result})
df2 = df2.append(data)
df2 = df2.reset_index().drop(columns=['index'])
For rows:
chicken cinnamon ginger onion soy_sauce
cardamom coconut pumpkin
Output:
result
0 ([ginger, onion, soy_sauce, chicken:cinnamon], [cardamom, coconut, pumpkin])
1 ([cinnamon, onion, soy_sauce, chicken:ginger], [cardamom, coconut, pumpkin])
2 ([cinnamon, ginger, soy_sauce, chicken:onion], [cardamom, coconut, pumpkin])
3 ([cinnamon, ginger, onion, chicken:soy_sauce], [cardamom, coconut, pumpkin])
4 ([chicken, onion, soy_sauce, cinnamon:ginger], [cardamom, coconut, pumpkin])
5 ([chicken, ginger, soy_sauce, cinnamon:onion], [cardamom, coconut, pumpkin])
6 ([chicken, ginger, onion, cinnamon:soy_sauce], [cardamom, coconut, pumpkin])
7 ([chicken, cinnamon, soy_sauce, ginger:onion], [cardamom, coconut, pumpkin])
8 ([chicken, cinnamon, onion, ginger:soy_sauce], [cardamom, coconut, pumpkin])
9 ([chicken, cinnamon, ginger, onion:soy_sauce], [cardamom, coconut, pumpkin])
10 ([pumpkin, cardamom:coconut], [chicken, cinnamon, ginger, onion, soy_sauce])
11 ([coconut, cardamom:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
12 ([cardamom, coconut:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])