Remove columns from dataframe where ALL values are

2019-01-07 03:07发布

I'm having trouble with a data frame and couldn't really resolve that issue myself:
The dataframe has arbitrary properties as columns and each row represents one data set.

The question is:
How to get rid of columns where for ALL rows the value is NA?

7条回答
Luminary・发光体
2楼-- · 2019-01-07 03:30

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]
查看更多
太酷不给撩
3楼-- · 2019-01-07 03:41

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.

naColsRemoval = function (DataTable) { na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )] DataTable [ , unlist (na.cols) := NULL , with = F] }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

查看更多
【Aperson】
4楼-- · 2019-01-07 03:42

The accepted answer does not work with non-numeric columns. From this answer, the following works with columns containing different data types

Filter(function(x) !all(is.na(x)), df)
查看更多
Fickle 薄情
5楼-- · 2019-01-07 03:46
df[sapply(df, function(x) all(is.na(x)))] <- NULL
查看更多
SAY GOODBYE
6楼-- · 2019-01-07 03:52

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18 
查看更多
神经病院院长
7楼-- · 2019-01-07 03:53

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9
查看更多
登录 后发表回答