Working with 10+GB dataset in Python Pandas

2020-07-23 05:08发布

问题:

I have a very large .csv (which originally came from a SAS dataset) that has the following columns:

target_series  metric_series        month   metric_1  metric_2  target_metric
1              1                    1       #float    #float    #float
1              1                    2       #float    #float    #float
...
1              1                    60      #float    #float    #float
1              2                    1       #float    #float    #float
1              2                    2       #float    #float    #float
...
1              80000                60      #float    #float    #float
2              1                    1       #float    #float    #float
...
50             80000                60      #float    #float    #float

As you can see, the file has 60 months times 80000 independent series times 50 target series worth of rows, and takes over 10 GB of space when saved as a .csv . What I need to do is perform and record a correlation between each metric_1 and metric_2 with the target_metric.

I have written the following code:

import pandas as pd
from datetime import datetime

data = pd.read_csv("data.csv")  # approximately 10 GB 
output = []

for target_num in range(1,50):
    for metric_number in range(1,80000):
        startTime = datetime.now()  # Begin the timer
        current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]  # Select the current 60 months period that we want to perform the correlation on
        print('The process took: '+str(datetime.now() - startTime)+' hours:minutes:seconds to complete.')  # Stop the timer
        results_amount_target = current_df[['metric_1','target_metric']].corr()  # Perform metric_1 correlation
        results_count_target = current_df[['metric_2','target_metric']].corr()  # Perform metric_2 correlation

        output.append([target_num, independent_number, results_amount_target.iat[0,1], results_count_target.iat[0,1]])  # Record the correlation in a Python list will be converted to a DataFrame later

The reason that I have the datetime code in there was to find out why this process was taking so long. The timer code is wrapped around the current_df line which is by far the slowest (I have played around with the placement of the datetime to find out why this was taking so long.

I found that selecting part of the data with this line of code:

current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]

takes 1.5 seconds each time. This is way too slow for me to do this process! Clearly something needs to change!

I decided to try a different approach. Since I know that I want to iterate through the dataset 60 lines at a time (for each target_series and metric_series pair) I would try one of two things:

  1. Read in the top 60 lines from data.csv, perform the correlation, and then read in the next 60 lines from data.csv, with the following code data = pd.read_csv('data.csv', nrows=60,skiprows=60). While this is faster for the first part of the dataset it gets unbearably slow as I have to keep skipping over the dataset. It took over 10 minutes to read in the last 60 lines in the dataset on my PC!

  2. Read in the top 60 lines from the DataFrame stored in memory with something like data.head(60) and then delete that data from the dataframe with data = data.drop(data.head(60).index) but this was even slower!

At this point, I'm exploring using HDFStore or h5py to move the dataset from .csv to a .h5, but I am unsure how to proceed. The computer I'm doing this analysis on only has 16 GB of memory, and in the future I can expect to be working with data that is even larger than this file.

What's the best way to solve this problem, and how can I prepare to work with even larger data in Python / Pandas?

UPDATE

Thanks to filmor, I have rewritten my code to be the following:

import pandas as pd
from datetime import datetime

data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable
output = []

for chunk in data:
    results_amount_target = chunk[['metric_1','target_metric']].corr()  # Perform metric_1 correlation
    results_count_target = chunk[['metric_2','target_metric']].corr()  # Perform metric_2 correlation

    output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]])  # Record the correlation in a Python list will be converted to a DataFrame later

This is now super fast and memory light! I would still appreciate if someone could walk me through how to do this with HDFStore or .h5 files.