I have been playing with pandas lately and I now I tried to replace NaN value inside a dataframe with different random value of normal distribution.
Assuming I have this CSV file without header
0
0 343
1 483
2 101
3 NaN
4 NaN
5 NaN
My expected result should be something like this
0
0 343
1 483
2 101
3 randomnumber1
4 randomnumber2
5 randomnumber3
But instead I got the following :
0
0 343
1 483
2 101
3 randomnumber1
4 randomnumber1
5 randomnumber1 # all NaN filled with same number
My code so far
import numpy as np
import pandas as pd
df = pd.read_csv("testfile.csv", header=None)
mu, sigma = df.mean(), df.std()
norm_dist = np.random.normal(mu, sigma, 1)
for i in norm_dist:
print df.fillna(i)
I am thinking to get the number of NaN row from the dataframe, and replace the number 1 in np.random.normal(mu, sigma, 1)
with the total of NaN row so each NaN might have different value.
But I want to ask if there is other simple method to do this?
Thank you for your help and suggestion.
Here's one way working with underlying array data -
def fillNaN_with_unifrand(df):
a = df.values
m = np.isnan(a) # mask of NaNs
mu, sigma = df.mean(), df.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return df
In essence, we are generating all random numbers in one go with the count of NaNs using the size param with np.random.normal
and assigning them in one go with the mask of the NaNs again.
Sample run -
In [435]: df
Out[435]:
0
0 343.0
1 483.0
2 101.0
3 NaN
4 NaN
5 NaN
In [436]: fillNaN_with_unifrand(df)
Out[436]:
0
0 343.000000
1 483.000000
2 101.000000
3 138.586483
4 223.454469
5 204.464514
It is simple to impute random values in place of missing values in a pandas DataFrame column.
mean = df['column'].mean()
std = df['column'].std()
def fill_missing_from_Gaussian(column_val):
if np.isnan(column_val) == True:
column_val = np.random.normal(mean, std, 1)
else:
column_val = column_val
return column_val
Now just apply the above method to a column with missing values.
df['column'] = df['column'].apply(fill_missing_from_Gaussian)
I think you need:
mu, sigma = df.mean(), df.std()
#get mask of NaNs
a = df[0].isnull()
#get random values by sum ot Trues, processes like 1
norm_dist = np.random.normal(mu, sigma, a.sum())
print (norm_dist)
[ 184.90581318 364.89367364 181.46335348]
#assign values by mask
df.loc[a, 0] = norm_dist
print (df)
0
0 343.000000
1 483.000000
2 101.000000
3 184.905813
4 364.893674
5 181.463353