I am trying to replace the NA's in each column of a matrix with the median of of that column, however when I try to use lapply
or sapply
I get an error; the code works when I use a for-loop and when I change one column at a time, what am I doing wrong?
Example:
set.seed(1928)
mat <- matrix(rnorm(100*110), ncol = 110)
mat[sample(1:length(mat), 700, replace = FALSE)] <- NA
mat1 <- mat2 <- mat
mat1 <- lapply(mat1,
function(n) {
mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE)
}
)
for (n in 1:ncol(mat2)) {
mat2[is.na(mat2[,n]),n] <- median(mat2[,n], na.rm = TRUE)
}
I would suggest vectorizing this using the matrixStats
package instead of calculating a median per column using either of the loops (sapply
is also a loop in a sense that its evaluates a function in each iteration).
First, we will create a NA
s index
indx <- which(is.na(mat), arr.ind = TRUE)
Then, replace the NA
s using the precalculated column medians and according to the index
mat[indx] <- matrixStats::colMedians(mat, na.rm = TRUE)[indx[, 2]]
You can use sweep
:
sweep(mat, MARGIN = 2,
STATS = apply(mat, 2, median, na.rm=TRUE),
FUN = function(x,s) ifelse(is.na(x), s, x)
)
EDIT:
You can also drop in STATS=matrixStats::colMedians(mat, na.rm=TRUE)
for a little more performance.
lapply
loops over a list. Do you mean to loop over the columns?
matx <- sapply(seq_len(ncol(mat1)), function(n) {
mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE)
})
though that's essentially just doing what your loop example does (but presumably faster).
You could possibly get there easier via conversion to data.frame
and back to matrix
as a result, using vapply
:
vapply(as.data.frame(mat1), function(x)
replace(x, is.na(x), median(x,na.rm=TRUE)), FUN.VALUE=numeric(nrow(mat1))
)