How to perform RMSE with missing values?

2020-08-14 07:06发布

问题:

I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.

But now I want to check the accuracy using the RMSE and I tried 2 options:

  1. load the package hydroGOF and apply the rmse function
  2. sqrt(mean (obs-sim)^2), na.rm=TRUE)

In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.

This is happening because the original data set contains an NA value (some values are missing).

How can I calculate the RMSE if I remove the missing values? Then obs and sim will have different sizes.

回答1:

How about simply...

sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )

Obviously assuming your dataframe is called df and you have to decide on your N ( i.e. nrow(df) includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df) you probably want to use sum( !is.na(df$measure) ) ) or, following @Joshua just

sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )


回答2:

The rmse() function in R package hydroGOF has an NA-remove parameter:

# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)

which, according to the documentation, does the expected when na.rm is TRUE:

"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value of obs AND sim are removed before the computation."

Without a minimal reproducible example, it's hard to say why that didn't work for you.

If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:

my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
     , df.obs[!is.na(df.obs$col_with_missing_data),])

assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.

Here is a canonical way to do the same thing if you have more than one column with missing data:

rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])


标签: r hydrogof