How do I add random `NA`s into a data frame

I created a data frame with random values

n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE), 
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)

and would like to introduce a few NA values to simulate real world data. I am trying to use apply but cannot get there. The line

apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})

will retrieve random values alright, but

apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA})

will not set them to NA. Have tried with and within, too.

Brute force works:

for (i in (1:floor(n/10))) {
  df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
  }

But I'd prefer to use the apply family.

标签： r dataframe apply

5条回答

对你真心纯属浪费

2楼-- · 2019-02-16 19:24

Simply pass your dataframe into the following function. The only arguments are the frame you want to add NAs to and the number of features (columns) you want to have with NAs.

add_random_nas_to_frame <- function(frame, num_features) {
   col_order <- names(frame) 
   rand_cols <- sample(ncol(frame), num_features)
   left_overs <- which(!names(frame) %in% names(frame[,rand_cols]))
   other_frame <- frame[,left_overs]
   nas_added <- data.frame(lapply(frame[,rand_cols], function(x) x[sample(c(TRUE, NA), prob = c(sample(100, 1)/100, 0.15), size = length(x), replace = TRUE)]))
   final_frame <- cbind(other_frame, nas_added)
   final_frame <- final_frame[,col_order]
   return(final_frame)
}

For example, using the full dataset from banking dataset from UCI:

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

bank <- read.table(file='path_to_data', sep =";", stringsAsFactors = F, header = T)

And viewing the original missing data:

We can see there is no missing data in the original frame.

Now applying our function:

bank_nas <- add_random_nas_to_frame(bank, 5)

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-02-16 19:26

I think you need to return the x value from the function:

apply(subset(df,select=-id), 2, function(x) 
     {x[sample(c(1:n),floor(n/10))]<-NA; x})

but you also need to assign this back to the relevant subset of the data frame (and subset(...) <- ... doesn't work)

idCol <- names(df)=="id"
df[,!idCol] <- apply(df[,!idCol], 2, function(x) 
     {x[sample(1:n,floor(n/10))] <- NA; x})

(if you have only a single non-ID column you'll need df[,!idCol,drop=FALSE])

0人赞添加讨论(0) 举报

ゆ、 Hurt°

4楼-- · 2019-02-16 19:27

Return x within your function:

> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
      id   age  sex
[45,] "45" "41" NA 
[46,] "46" NA   "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA 
[50,] "50" "74" "f"

0人赞添加讨论(0) 举报

倾城　Initia

5楼-- · 2019-02-16 19:43

Apply returns an array, thereby converting all columns to the same type. You could use this instead:

df[,-1] <- do.call(cbind.data.frame, 
                   lapply(df[,-1], function(x) {
                     x[sample(c(1:n),floor(n/10))]<-NA
                     x
                   })
                   )

Or use a for loop:

for (i in seq_along(df[,-1])+1) {
  is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}

0人赞添加讨论(0) 举报

走好不送

6楼-- · 2019-02-16 19:46

here is another simple way to go at it

your data frame

df<-mtcars

Number of missing required

nbr_missing<-20

sample row and column indices

y<-data.frame(row=sample(nrow(df),size=nbr_missing,replace = T),
          col=sample(ncol(df),size = nbr_missing,replace = T))

remove duplication

y<-y[!duplicated(y),]

use matrix indexing

df[as.matrix(y)]<-NA

0人赞添加讨论(0) 举报

How do I add random `NA`s into a data frame

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间