Understanding this Pandas script

2019-04-02 06:22发布

I received this code to group data into a histogram type data. I have been Attempting to understand the code in this pandas script in order to edit, manipulate and duplicate it. I have comments for the sections I understand.

Code

import numpy as np
import pandas as pd


column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 
              'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels.  If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.

df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the  csv file

df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05.  This will allow all of col11 to be treated as a numbers

bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in.  >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'

df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them


df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label


df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')

The section I really struggle to understand is in this section:

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')

If any one is able to comment this script I would be greatly appreciative. Also feel free to correct or add to my comments (these are what I assume so far they may not be correct). Im hoping this isnt too off topic for SOF. I will gladly give a 50 point bounty to any user who can help me with this.

1条回答
forever°为你锁心
2楼-- · 2019-04-02 07:14

I'll try and explain my code. As it uses a few tricks.

  • I've called it df to give a shorthand name for a pandas DataFrame
  • I've called it dfg to mean group my df.
  • Let me build up the expression dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})

    • the code dfg = df[['bin','col7','col11']] is saying take the columns named 'bin' 'col7' and 'col11' from my DataFrame df.
    • Now that I have the 3 columns I am interested in, I want to group by the values in the 'bin' column. This is done by dfg = df[['bin','col7','col11']].groupby('bin'). I now have groups of data i.e. all records that are in bin #1, all records in bin#2, etc.
    • I now want to apply some aggregate functions to the records in each of my bin groups( An aggregate funcitn is something like sum, mean or count).
    • Now I want to apply three aggregate functions to the records in each of my bins: the mean of 'col11', the number of records in each bin, and the number of records in each bin that have 'col7' equal to one. The mean is easy; numpy already has a function to calculate the mean. If I was just doing the mean of 'col11' I would write: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]}). The number of records is also easy; python's len function (It's not really a function but a property of lists etc.) will give us the number of items in list. So I now have dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]}). Now I can't think of an existing function that counts the number of ones in a numpy array (it has to work on a numpy array). I can define my own functions that work on a numpy array, hence my function count_ones.
    • Now I'll deconstruct the count_ones function. the varibale x passed to the function is always going to be a 1d numpy array. In our specific case it will be all the 'col7' values that fall in bin#1, all the 'col7' values that fall in bin#2 etc.. The code x==1 will create a boolean (TRUE/FALSE) array the same size as x. The entries in the boolean array will be True if the corresponding values in x are equal to 1 and false otherwise. Because python treats True as 1 if I sum the values of my boolean array I'll get a count of the values that ==1. Now that I have my count_ones function I apply it to 'col7' by: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})

    • You can see that the syntax of the .agg is .agg({'column_name_to_apply_to': [list_of_function names_to_apply]}

    • With the boolean arrays you can do all sorts of wierd condition combinations (x==6) | (x==3) would be 'x equal to 6 or x equal to 3'. The 'and' operator is &. Always put () around each condition

  • Now to dfg.index = labels[dfg.index]. In dfg, because I grouped by 'bin', the index (or row label) of each row of grouped data (i.e. my dfg.index) will be my bin numbers:1,2,3, labels[dfg.index] is using fancy indexing of a numpy array. labels[0] would give me the first label, labels[3] would give me the 4th label. With normal python lists you can use slices to do labels[0:3] which would give me labels 0,1, and 2. With numpy arrays we can go a step further and just index with a list of values or another array so labels[np.array([0,2,4]) would give me labels 0,2,4. By using labels[dfg.index] I'm requesting the labels corresponding to the bin#. Basically I'm changng my bin number to bin label. I could have done that to my original data but that would be thousands of rows; by doing it after the group by I'm doing it to 21 rows or so. Note that I cannot just do dfg.index = labels as some of my bins might be empty and therefore not present in the group by data.

  • Now the dfg.ix['x',('col11', 'mean')]='N/A' part. Remember way back when I did df.ix[df.col11 == 'x', 'col11']=-0.08 that was so all my invalid data was treated as a number and would be placed into the 1st bin. after applying group by and aggregate functions the mean of 'col11' values in my first bin will be -0.08 (because all such values are -0.08). Now I know this not correct, all values of -0.08 actually indicate that the original value wsa x. You can't do a mean of x. So I manually put it to N/A. ie. dfg.ix['x',('col11', 'mean')]='N/A' means in dfg where index (or row) is 'x' and column is 'col11 mean') set the value to 'N/A'. the ('col11', 'mean') I believe is how pandas comes up with the aggreagate column names i.e. when I did .agg({'col11': [np.mean]}), to refer to the resulting aggregate column i need ('column_name', 'aggregate_function_name')

The motivation for all this was: convert all data to numbers so I can use the power of Pandas, then after processing, manually change any values that I know are garbage. Let me know if you need any more explanation.

查看更多
登录 后发表回答