How to delete rows from a dataframe that contain n

I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.

My Dataframe looks something like this:

 ID q  r  s  t  u  v  w  x  y  z
 A  1  5  NA 3  8  9  NA 8  6  4
 B  5  NA 4  6  1  9  7  4  9  3 
 C  NA 9  4  NA 4  8  4  NA 5  NA
 D  2  2  6  8  4  NA 3  7  1  32

And I would like to be able to delete the rows that contain more than 2 cells containing NA to get

ID q  r  s  t  u  v  w  x  y  z
 A 1  5  NA 3  8  9  NA 8  6  4
 B 5  NA 4  6  1  9  7  4  9  3 
 D 2  2  6  8  4  NA 3  7  1  32

complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?

Alternatively, this dataframe is generated by merging several dataframes using

    file1<-read.delim("~/file1.txt")
    file2<-read.delim(file=args[1])

    file1<-merge(file1,file2,by="chr.pos",all=TRUE)

Perhaps the merge function could be altered?

Thanks

4条回答

做个烂人

2楼-- · 2019-01-04 03:47

If dat is the name of your data.frame the following will return what you're looking for:

keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]

What this is doing:

is.na(dat) 
# returns a matrix of T/F
# note that when adding logicals 
# T == 1, and F == 0

rowSums(.)
# quickly computes the total per row 
# since your task is to identify the
# rows with a certain number of NA's 

rowSums(.) < 2 
# for each row, determine if the sum 
# (which is the number of NAs) is less
# than 2 or not.  Returns T/F accordingly

We use the output of this last statement to identify which rows to keep. Note that it is not necessary to actually store this last logical.

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2019-01-04 03:56

Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:

df <- df[rowSums(is.na(df)) != n, ]

or to remove rows that contain n or more NA values:

df <- df[rowSums(is.na(df)) < n, ]

in both cases of course replacing n with the number that's required

0人赞添加讨论(0) 举报

Fickle 薄情

4楼-- · 2019-01-04 03:56

This will return a dataset where at most two values per row are missing:

dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]

0人赞添加讨论(0) 举报

叼着烟拽天下

5楼-- · 2019-01-04 04:04

If d is your data frame, try this:

d <- d[rowSums(is.na(d)) < 2,]

0人赞添加讨论(0) 举报

How to delete rows from a dataframe that contain n

What this is doing:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间