Our data output folders contain a variable number of .csv files that are associated with .xml files that contain all the different recording parameters. Each .csv file represents a "sweep" of recording data, so I'm currently trying to figure out how to combine all of these files together into one large multiindexed (Sweep# and Time) dataframe for processing (since we usually are looking at an entire set of sweeps at once and finding average values).
I so far have the following two functions. The first one just does some minor modificatiosn to the dataframe to make it more manageable down the road.
def import_pcsv(filename):
df = pd.read_csv(filename, skipinitialspace=True)
df.rename(columns = {df.columns[0]:'Time'}, inplace=True)
df.Time = df.Time/1000
df.set_index('Time', inplace=True)
return df
This second one is the real workhorse for parsing the folder. It grabs all the xml files in the folder, parses them (with another function I put together in another module), and then imports the associated csv files into one large dataframe.
def import_pfolder(folder):
vr_xmls = glob(folder+r'\*VoltageRecording*.xml')
data = pd.DataFrame()
counter = 1
for file in vr_xmls:
file_vals = pxml.parse_vr(file)
df = import_pcsv(folder + '\\' + file_vals['voltage recording'] + '.csv')
df['Sweep'] = 'Sweep' + str(counter)
df.set_index('Sweep', append=True, inplace=True)
data = data.append(df.reorder_levels(['Sweep','Time']))
counter+=1
return data
The problem is that this gets really slow if there are a large number of files in the folder. The first function is essentially as fast as the normal pandas read_csv function (it's a few ms slower, but that's fine)
I ran some timing tests for different number of csv/xml file pairs in the folder. the %time for each is:
1 file = 339 ms
5 files = 2.61 sec
10 files = 7.53 sec
20 files = 24.7 sec
40 files = 87 sec
That last one is a real killer.
In trying to figure this out I also got some time stats on each line of the for loop in import_pfolder() - the time in parentheses is best time from %timeit
1st line = 2 ms (614us)
2nd line = 98 ms (82.2ms)
3rd line = 21 ms (10.8ms)
4th line = 49 ms
5th line = 225 ms
I'm guessing the slow down is from having to copy over the entire dataframe in the last line for every loop through. I'm having trouble figuring out how to avoid this, though. The only column that I know for sure in the .csv files is the first one (Time) - the files from there can have a variable number of columns and rows. Is there a way to preallocate a dataframe beforehand that takes that variability into account? Would that even help?
Any suggestions would be appreciated.
Thanks