I have a dataframe of the following format.
df
A B Target
5 4 3
1 3 4
I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1])
.
But the issue is - size of my actual dataframe is (216, 72391)
which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.
Here, I have tried to implement your operation using numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link to colab notebook.
You should take a look at dask. It should be able to do what you want and a lot more.
It parallelizes most of the DataFrame functions.