Speeding up data-import function (Pandas and appen

2020-07-27 23:36发布

问题:

Our data output folders contain a variable number of .csv files that are associated with .xml files that contain all the different recording parameters. Each .csv file represents a "sweep" of recording data, so I'm currently trying to figure out how to combine all of these files together into one large multiindexed (Sweep# and Time) dataframe for processing (since we usually are looking at an entire set of sweeps at once and finding average values).

I so far have the following two functions. The first one just does some minor modificatiosn to the dataframe to make it more manageable down the road.

def import_pcsv(filename):
    df = pd.read_csv(filename, skipinitialspace=True)
    df.rename(columns = {df.columns[0]:'Time'}, inplace=True)
    df.Time = df.Time/1000
    df.set_index('Time', inplace=True)
    return df

This second one is the real workhorse for parsing the folder. It grabs all the xml files in the folder, parses them (with another function I put together in another module), and then imports the associated csv files into one large dataframe.

def import_pfolder(folder):
    vr_xmls = glob(folder+r'\*VoltageRecording*.xml')
    data = pd.DataFrame()
    counter = 1

    for file in vr_xmls:
        file_vals = pxml.parse_vr(file)
        df = import_pcsv(folder + '\\' + file_vals['voltage recording'] + '.csv')
        df['Sweep'] = 'Sweep' + str(counter)
        df.set_index('Sweep', append=True, inplace=True)
        data = data.append(df.reorder_levels(['Sweep','Time']))
        counter+=1

    return data

The problem is that this gets really slow if there are a large number of files in the folder. The first function is essentially as fast as the normal pandas read_csv function (it's a few ms slower, but that's fine)

I ran some timing tests for different number of csv/xml file pairs in the folder. the %time for each is:

1 file = 339 ms

5 files = 2.61 sec

10 files = 7.53 sec

20 files = 24.7 sec

40 files = 87 sec

That last one is a real killer.

In trying to figure this out I also got some time stats on each line of the for loop in import_pfolder() - the time in parentheses is best time from %timeit

1st line = 2 ms (614us)

2nd line = 98 ms (82.2ms)

3rd line = 21 ms (10.8ms)

4th line = 49 ms

5th line = 225 ms

I'm guessing the slow down is from having to copy over the entire dataframe in the last line for every loop through. I'm having trouble figuring out how to avoid this, though. The only column that I know for sure in the .csv files is the first one (Time) - the files from there can have a variable number of columns and rows. Is there a way to preallocate a dataframe beforehand that takes that variability into account? Would that even help?

Any suggestions would be appreciated.

Thanks

回答1:

Don't append DataFrames like that at all (nor start with an empty one), each append is a copy. This will result in a single copy, and constant appending performance. Concat docs are here

Instead:

frames = []

for f in files:
      frames.append(process_your_file(f))

result = pd.concat(frames)