How to perform RMSE with missing values?

I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.

But now I want to check the accuracy using the RMSE and I tried 2 options:

load the package hydroGOF and apply the rmse function
sqrt(mean (obs-sim)^2), na.rm=TRUE)

In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.

This is happening because the original data set contains an NA value (some values are missing).

How can I calculate the RMSE if I remove the missing values? Then obs and sim will have different sizes.

标签： r hydrogof

2条回答

smile是对你的礼貌

2楼-- · 2020-08-14 08:03

How about simply...

sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )

Obviously assuming your dataframe is called df and you have to decide on your N ( i.e. nrow(df) includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df) you probably want to use sum( !is.na(df$measure) ) ) or, following @Joshua just

sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )

0人赞添加讨论(0) 举报

傲

3楼-- · 2020-08-14 08:06

The rmse() function in R package hydroGOF has an NA-remove parameter:

# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)

which, according to the documentation, does the expected when na.rm is TRUE:

"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value of obs AND sim are removed before the computation."

Without a minimal reproducible example, it's hard to say why that didn't work for you.

If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:

my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
     , df.obs[!is.na(df.obs$col_with_missing_data),])

assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.

Here is a canonical way to do the same thing if you have more than one column with missing data:

rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])

0人赞添加讨论(0) 举报

How to perform RMSE with missing values?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间