I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.
But now I want to check the accuracy using the RMSE and I tried 2 options:
- load the package
hydroGOF
and apply the rmse
function
sqrt(mean (obs-sim)^2), na.rm=TRUE)
In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.
This is happening because the original data set contains an NA
value (some values are missing).
How can I calculate the RMSE if I remove the missing values? Then obs
and sim
will have different sizes.
How about simply...
sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )
Obviously assuming your dataframe is called df
and you have to decide on your N ( i.e. nrow(df)
includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df)
you probably want to use sum( !is.na(df$measure) )
) or, following @Joshua just
sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )
The rmse() function in R package hydroGOF has an NA-remove parameter:
# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)
which, according to the documentation, does the expected when na.rm
is TRUE:
"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value
of obs AND sim are removed before the computation."
Without a minimal reproducible example, it's hard to say why that didn't work for you.
If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:
my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
, df.obs[!is.na(df.obs$col_with_missing_data),])
assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.
Here is a canonical way to do the same thing if you have more than one column with missing data:
rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])