I have many dataframes (csv files) located in various folders within my documents on my computer. All csv files have the same number of columns, where the name of each column is the same and in the same order, however, varying numbers of rows of data exist across the files. All files have "Datetime" as their first column, where "YYYYMMDD HH:MM:SS" is recorded down the rows of data.
I want to resample all dataframes to create hourly, daily and monthly data from the original files. Some columns within the csv files require the data to be summed over the hour, day and month, whereas other columns require averaging the data over these time frames (consistent across all csv files). I wish for each original csv file to have an hourly, daily and monthly version of itself.
All csv files exist within the following inputpath, but are actually located in other files within this inputpath:
inputpath = 'C:/Users/cp_vm/Documents/Data/Input/Clean/'
For instance, within the folder "Clean" there are two more folders and within those folders are others that contain all the csv files. I want to read in all csv files that exist past the "Clean" folder, without having to write multiple 'inputpaths' and resample the files as mentioned previously.
I then want to export these resampled dataframes to the folders that they came from and add either "hour", "day", or "month" to their names.
A snippet example of the csv files is available below.
For example, I want to resample the 15 minute interval data to hourly, daily and monthly data. The first two columns need to be summed when resampled, whereas the third column needs to be averaged over the resampled time frame. I understand that df.resample('h').sum() will sum the data over the hour and *.mean() will average the data over the hour, but it is tricky when different columns require different resampling and then wanting to combine all these columns together and then export the hourly, daily or monthly file to its respective location where the file name has automatically been changed to show it is now hourly, daily or monthly.
I realise this is quite descriptive, but it's causing me all sorts of delays and relocating the csv files will affect another python script that I've already written. Any suggestions would be greatly appreciated. Thank you in advance for any help you can provide.
It's hard to find out where your problem really is :)
But python has something like os.walk, let me provide you an example:
import os
root_directory = '/home/xyz/some_root_dir/'
def is_csv(fname):
return fname.endswith('.csv')
csv_files = []
for directory, subdirectories, files_names in os.walk(root_directory):
for fname in files_names:
if is_csv(fname):
csv_files.append(
{
'directory': directory,
'fname': fname
}
)
print(csv_files)
And this in my test case:
[
{'directory': '/home/xyz/some_root_dir', 'fname': 'my.csv'},
{'directory': '/home/xyz/some_root_dir/test2/test31', 'fname': 'myohter3.csv'}
{'directory': '/home/xyz/some_root_dir/test2/test31', 'fname': 'myohter.csv'}
]
This will for sure help you with getting all the csv files - you can modify the is_csv method for your needs. I am not able to help you with aggregating data :)
But once you read all the data it shouldn't be much of a problem.
Ok, now fun begins. I do it very fast - and probably can be written better, but it is good start point, we have file list from the previous step, let make next steps:
import csv
import os
from datetime import datetime
data = {}
# gather the data;
for fdata in csv_files:
with open(os.path.join(fdata['directory'], fdata['fname']), 'r') as f:
reader = csv.reader(f, delimiter='|', quotechar='"')
rows = list(reader)
data[fdata['fname']] = rows # we can't store it per datetime here, because can lost data
# ok we have a data now in format:
# {
# 'other3.csv': [
# ['Datetime', 'Egen1_NotCum_kWh', 'Egen2_NotCum_kWh', 'Egen3_NotCum_kWh'],
# ['2016-09-04 13:45:00', '643.23', '649', '654'],
# ['2016-09-04 14:00:00', '612.21', '672', '666'],
# ['2016-09-04 14:15:00', '721.3', '719', '719'],
# ['2016-09-04 14:30:00', '730', '721', '725'],
# ['2016-09-04 14:45:00', '745', '725', '731']],
# 'my.csv': ...
# }
# convert the string data to python Datetime
DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
for fname, inner_data in data.iteritems():
for row in inner_data[1:]: # skip headers
p_datetime = datetime.strptime(row[0], DATETIME_FORMAT)
row[0] = p_datetime
# now the aggregates;
def get_all_rows_in_dates(start_date, end_date, data):
headers = data[data.keys()[0]][0]
data_rows = []
for fname, inner_data in data.iteritems():
for row in inner_data[1:]: # skip the header
if start_date <= row[0] < end_date:
data_rows.append(row)
return headers, data_rows
def aggregate_col_12(values):
values = map(float, values)
return sum(values)
def aggregate_col_3(values):
values = map(float, values)
return sum(values) / float(len(values))
def count_aggregates(rows_in_dates, start_date):
col1 = []
col2 = []
col3 = []
for row in rows_in_dates[1:]: # skip headers
col1.append(row[1])
col2.append(row[2])
col3.append(row[3])
return [start_date.strftime(DATETIME_FORMAT),
aggregate_col_12(col1), aggregate_col_12(col2), aggregate_col_3(col3)]
def write_results(headers, aggregate, fname):
data = []
data.append(headers)
data.append(aggregate)
with open(fname, 'w+') as f:
writer = csv.writer(f, delimiter='|', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerows(data)
start_date = datetime(2016, 9, 4, 13, 0, 0)
end_date = datetime(2016, 9, 4, 14, 0, 0)
headers, to_aggregate = get_all_rows_in_dates(
start_date,
end_date,
data)
aggregates = count_aggregates(to_aggregate, start_date)
write_results(headers, aggregates, 'from_{}_to_{}.csv'.format(
start_date.strftime(DATETIME_FORMAT),
end_date.strftime(DATETIME_FORMAT),
))
Take care - user an appropriate delimiter and quotechar in your code.
And this is only begining - you can use it as a start - the daily aggregate - should be achievable with this code, but if you want for example have a csv with per second row for an hour - you need to wrap it a little.
If you have any questions - please do.