I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here's a subsection of it:
ID Age BMI Risk Factor
PT 6 48 19.3 4
PT 8 43 20.9 NaN
PT 2 39 18.1 3
PT 9 41 19.5 NaN
Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to zscore normalize pandas column with nans?
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
I'm interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using
df2.to_excel("Z-Scores.xlsx")
So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?
SIDENOTE: there is a concept in pandas called "indexing" which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.
Build a list from the columns and remove the column you don't want to calculate the Z score for:
In [66]:
cols = list(df.columns)
cols.remove('ID')
df[cols]
Out[66]:
Age BMI Risk Factor
0 6 48 19.3 4
1 8 43 20.9 NaN
2 2 39 18.1 3
3 9 41 19.5 NaN
In [68]:
# now iterate over the remaining columns and create a new zscore column
for col in cols:
col_zscore = col + '_zscore'
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
df
Out[68]:
ID Age BMI Risk Factor Age_zscore BMI_zscore Risk_zscore \
0 PT 6 48 19.3 4 -0.093250 1.569614 -0.150946
1 PT 8 43 20.9 NaN 0.652753 0.074744 1.459148
2 PT 2 39 18.1 3 -1.585258 -1.121153 -1.358517
3 PT 9 41 19.5 NaN 1.025755 -0.523205 0.050315
Factor_zscore
0 1
1 NaN
2 -1
3 NaN
Using Scipy's zscore function:
df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
df
| | A | B | C |
|---:|----:|----:|----:|
| 0 | 163 | 163 | 159 |
| 1 | 120 | 153 | 181 |
| 2 | 130 | 199 | 108 |
| 3 | 108 | 188 | 157 |
| 4 | 109 | 171 | 119 |
from scipy.stats import zscore
df.apply(zscore)
| | A | B | C |
|---:|----------:|----------:|----------:|
| 0 | 1.83447 | -0.708023 | 0.523362 |
| 1 | -0.297482 | -1.30804 | 1.3342 |
| 2 | 0.198321 | 1.45205 | -1.35632 |
| 3 | -0.892446 | 0.792025 | 0.449649 |
| 4 | -0.842866 | -0.228007 | -0.950897 |
If not all the columns of your data frame are numeric, then you can apply the Z-score function only to the numeric columns using the select_dtypes
function:
# Note that `select_dtypes` returns a data frame. We are selecting only the columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].apply(zscore)
| | A | B | C |
|---:|----------:|----------:|----------:|
| 0 | 1.83447 | -0.708023 | 0.523362 |
| 1 | -0.297482 | -1.30804 | 1.3342 |
| 2 | 0.198321 | 1.45205 | -1.35632 |
| 3 | -0.892446 | 0.792025 | 0.449649 |
| 4 | -0.842866 | -0.228007 | -0.950897 |
If you want to calculate the zscore for all of the columns, you can just use the following:
df_zscore = (df - df.mean())/df.std()
The almost one-liner solution:
df2 = (df.ix[:,1:] - df.ix[:,1:].mean()) / df.ix[:,1:].std()
df2['ID'] = df['ID']
for Z score, we can stick to documentation instead of using 'apply' function
df_zscore = zscore(cols, axis=1)
When we are dealing with time-series, calculating z-scores (or anomalies - not the same thing, but you can adapt this code easily) is a bit more complicated. For example, you have 10 years of temperature data measured weekly. To calculate z-scores for the whole time-series, you have to know the means and standard deviations for each day of the year. So, let's get started:
Assume you have a pandas DataFrame. First of all, you need a DateTime index. If you don't have it yet, but luckily you do have a column with dates, just make it as your index. Pandas will try to guess the date format. The goal here is to have DateTimeIndex. You can check it out by trying:
type(df.index)
If you don't have one, let's make it.
df.index = pd.DatetimeIndex(df[datecolumn])
df = df.drop(datecolumn,axis=1)
Next step is to calculate mean and standard deviation for each group of days. For this, we use the groupby method.
mean = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanmean)
std = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanstd)
Finally, we loop through all the dates, performing the calculation (value - mean)/stddev; however, as mentioned, for time-series this is not so straightforward.
df2 = df.copy() #keep a copy for future comparisons
for y in np.unique(df.index.year):
for d in np.unique(df.index.dayofyear):
df2[(df.index.year==y) & (df.index.dayofyear==d)] = (df[(df.index.year==y) & (df.index.dayofyear==d)]- mean.ix[d])/std.ix[d]
df2.index.name = 'date' #this is just to look nicer
df2 #this is your z-score dataset.
The logic inside the for loops is: for a given year we have to match each dayofyear to its mean and stdev. We run this for all the years in your time-series.
Here's other way of getting Zscore using custom function:
In [6]: import pandas as pd; import numpy as np
In [7]: np.random.seed(0) # Fixes the random seed
In [8]: df = pd.DataFrame(np.random.randn(5,3), columns=["randomA", "randomB","randomC"])
In [9]: df # watch output of dataframe
Out[9]:
randomA randomB randomC
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
## Create custom function to compute Zscore
In [10]: def z_score(df):
....: df.columns = [x + "_zscore" for x in df.columns.tolist()]
....: return ((df - df.mean())/df.std(ddof=0))
....:
## make sure you filter or select columns of interest before passing dataframe to function
In [11]: z_score(df) # compute Zscore
Out[11]:
randomA_zscore randomB_zscore randomC_zscore
0 0.798350 -0.106335 0.731041
1 1.505002 1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464 1.292230
4 -0.688061 -0.494655 0.099824
Result reproduced using scipy.stats zscore
In [12]: from scipy.stats import zscore
In [13]: df.apply(zscore) # (Credit: Manuel)
Out[13]:
randomA randomB randomC
0 0.798350 -0.106335 0.731041
1 1.505002 1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464 1.292230
4 -0.688061 -0.494655 0.099824