I have a very large .csv (which originally came from a SAS dataset) that has the following columns:
target_series metric_series month metric_1 metric_2 target_metric
1 1 1 #float #float #float
1 1 2 #float #float #float
...
1 1 60 #float #float #float
1 2 1 #float #float #float
1 2 2 #float #float #float
...
1 80000 60 #float #float #float
2 1 1 #float #float #float
...
50 80000 60 #float #float #float
As you can see, the file has 60
months times 80000
independent series times 50
target series worth of rows, and takes over 10 GB of space when saved as a .csv
. What I need to do is perform and record a correlation between each metric_1
and metric_2
with the target_metric
.
I have written the following code:
import pandas as pd
from datetime import datetime
data = pd.read_csv("data.csv") # approximately 10 GB
output = []
for target_num in range(1,50):
for metric_number in range(1,80000):
startTime = datetime.now() # Begin the timer
current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)] # Select the current 60 months period that we want to perform the correlation on
print('The process took: '+str(datetime.now() - startTime)+' hours:minutes:seconds to complete.') # Stop the timer
results_amount_target = current_df[['metric_1','target_metric']].corr() # Perform metric_1 correlation
results_count_target = current_df[['metric_2','target_metric']].corr() # Perform metric_2 correlation
output.append([target_num, independent_number, results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later
The reason that I have the datetime
code in there was to find out why this process was taking so long. The timer code is wrapped around the current_df
line which is by far the slowest (I have played around with the placement of the datetime
to find out why this was taking so long.
I found that selecting part of the data with this line of code:
current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]
takes 1.5 seconds each time. This is way too slow for me to do this process! Clearly something needs to change!
I decided to try a different approach. Since I know that I want to iterate through the dataset 60 lines at a time (for each target_series
and metric_series
pair) I would try one of two things:
Read in the top 60 lines from
data.csv
, perform the correlation, and then read in the next 60 lines fromdata.csv
, with the following codedata = pd.read_csv('data.csv', nrows=60,skiprows=60)
. While this is faster for the first part of the dataset it gets unbearably slow as I have to keep skipping over the dataset. It took over 10 minutes to read in the last 60 lines in the dataset on my PC!Read in the top 60 lines from the
DataFrame
stored in memory with something likedata.head(60)
and then delete that data from the dataframe withdata = data.drop(data.head(60).index)
but this was even slower!
At this point, I'm exploring using HDFStore
or h5py
to move the dataset from .csv
to a .h5
, but I am unsure how to proceed. The computer I'm doing this analysis on only has 16 GB of memory, and in the future I can expect to be working with data that is even larger than this file.
What's the best way to solve this problem, and how can I prepare to work with even larger data in Python / Pandas?
UPDATE
Thanks to filmor
, I have rewritten my code to be the following:
import pandas as pd
from datetime import datetime
data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable
output = []
for chunk in data:
results_amount_target = chunk[['metric_1','target_metric']].corr() # Perform metric_1 correlation
results_count_target = chunk[['metric_2','target_metric']].corr() # Perform metric_2 correlation
output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later
This is now super fast and memory light! I would still appreciate if someone could walk me through how to do this with HDFStore
or .h5
files.