Remove rows with all or some NAs (missing values)

2018-12-31 01:24发布

I'd like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I'd like to get a data frame such as the following.

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

b) contain NAs in only some columns, so I can also get this result:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

15条回答
梦该遗忘
2楼-- · 2018-12-31 02:04

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2
查看更多
忆尘夕之涩
3楼-- · 2018-12-31 02:05

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

查看更多
零度萤火
4楼-- · 2018-12-31 02:08

Another option if you want greater control over how rows are deemed to be invalid is

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

查看更多
骚的不知所云
5楼-- · 2018-12-31 02:08

Using dplyr package we can filter NA as follows:

dplyr::filter(df,  !is.na(columnname))
查看更多
无与为乐者.
6楼-- · 2018-12-31 02:09
delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

查看更多
明月照影归
7楼-- · 2018-12-31 02:11

I prefer following way to check whether rows contain any NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

sum(row.has.na)

and eventually drop them

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

查看更多
登录 后发表回答