I received this code to group data into a histogram type data. I have been Attempting to understand the code in this pandas script in order to edit, manipulate and duplicate it. I have comments for the sections I understand.
Code
import numpy as np
import pandas as pd
column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6',
'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels. If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.
df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the csv file
df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05. This will allow all of col11 to be treated as a numbers
bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in. >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'
df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them
df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label
df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]
dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
The section I really struggle to understand is in this section:
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]
dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
If any one is able to comment this script I would be greatly appreciative. Also feel free to correct or add to my comments (these are what I assume so far they may not be correct). Im hoping this isnt too off topic for SOF. I will gladly give a 50 point bounty to any user who can help me with this.
I'll try and explain my code. As it uses a few tricks.
df
to give a shorthand name for a pandas DataFramedfg
to mean group mydf
.Let me build up the expression
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg = df[['bin','col7','col11']]
is saying take the columns named 'bin' 'col7' and 'col11' from my DataFramedf
.dfg = df[['bin','col7','col11']].groupby('bin')
. I now have groups of data i.e. all records that are in bin #1, all records in bin#2, etc.dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]})
. The number of records is also easy; python'slen
function (It's not really a function but a property of lists etc.) will give us the number of items in list. So I now havedfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]})
. Now I can't think of an existing function that counts the number of ones in a numpy array (it has to work on a numpy array). I can define my own functions that work on a numpy array, hence my functioncount_ones
.Now I'll deconstruct the
count_ones
function. the varibalex
passed to the function is always going to be a 1d numpy array. In our specific case it will be all the 'col7' values that fall in bin#1, all the 'col7' values that fall in bin#2 etc.. The codex==1
will create a boolean (TRUE/FALSE) array the same size as x. The entries in the boolean array will be True if the corresponding values in x are equal to 1 and false otherwise. Because python treats True as 1 if I sum the values of my boolean array I'll get a count of the values that ==1. Now that I have mycount_ones
function I apply it to 'col7' by:dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
You can see that the syntax of the
.agg
is.agg({'column_name_to_apply_to': [list_of_function names_to_apply]}
With the boolean arrays you can do all sorts of wierd condition combinations (x==6) | (x==3) would be 'x equal to 6 or x equal to 3'. The 'and' operator is &. Always put
()
around each conditionNow to
dfg.index = labels[dfg.index]
. Indfg
, because I grouped by 'bin', the index (or row label) of each row of grouped data (i.e. my dfg.index) will be my bin numbers:1,2,3,labels[dfg.index]
is using fancy indexing of a numpy array. labels[0] would give me the first label, labels[3] would give me the 4th label. With normal python lists you can use slices to do labels[0:3] which would give me labels 0,1, and 2. With numpy arrays we can go a step further and just index with a list of values or another array so labels[np.array([0,2,4]) would give me labels 0,2,4. By usinglabels[dfg.index]
I'm requesting the labels corresponding to the bin#. Basically I'm changng my bin number to bin label. I could have done that to my original data but that would be thousands of rows; by doing it after the group by I'm doing it to 21 rows or so. Note that I cannot just dodfg.index = labels
as some of my bins might be empty and therefore not present in the group by data.Now the
dfg.ix['x',('col11', 'mean')]='N/A'
part. Remember way back when I diddf.ix[df.col11 == 'x', 'col11']=-0.08
that was so all my invalid data was treated as a number and would be placed into the 1st bin. after applying group by and aggregate functions the mean of 'col11' values in my first bin will be -0.08 (because all such values are -0.08). Now I know this not correct, all values of -0.08 actually indicate that the original value wsa x. You can't do a mean of x. So I manually put it to N/A. ie.dfg.ix['x',('col11', 'mean')]='N/A'
means in dfg where index (or row) is 'x' and column is 'col11 mean') set the value to 'N/A'. the('col11', 'mean')
I believe is how pandas comes up with the aggreagate column names i.e. when I did.agg({'col11': [np.mean]})
, to refer to the resulting aggregate column i need('column_name', 'aggregate_function_name')
The motivation for all this was: convert all data to numbers so I can use the power of Pandas, then after processing, manually change any values that I know are garbage. Let me know if you need any more explanation.