可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a report. (Don't worry, this data isn't being permanently deleted!)
I have read the following:
- Selecting columns in R data frame based on those *not* in a vector
- http://www.ats.ucla.edu/stat/r/faq/subset_R.htm
- How to combine multiple conditions to subset a data-frame using "OR"?
But I'm still not able to get this to work right. Here's my code:
bg2011missingFromBeg <- setdiff(x=eg2011$ID, y=bg2011$ID)
#attempt 1
eg2011cleaned <- subset(eg2011, ID != bg2011missingFromBeg)
#attempt 2
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg]
The first try just eliminates the first value in the resulting setdiff vector. The second try yields and unwieldy error:
Error in `[.data.frame`(eg2012, !eg2012$ID %in% bg2012missingFromBeg)
: undefined columns selected
回答1:
This will give you what you want:
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
The error in your second attempt is because you forgot the ,
In general, for convenience, the specification object[index]
subsets columns for a 2d object
. If you want to subset rows and keep all columns you have to use the specification
object[index_rows, index_columns]
, while index_cols
can be left blank, which will use all columns by default.
However, you still need to include the ,
to indicate that you want to get a subset of rows instead of a subset of columns.
回答2:
If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:
data_A[match(data_B$index, data_A$index, nomatch=0),]
data_B[match(data_A$index, data_B$index, nomatch=0),]
This is, though, the same as:
data_A[data_A$index %in% data_B$index,]
data_B[data_B$index %in% data_A$index,]
Here is a demo:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data sets.
data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90))
data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))
# Subset data of each data frame by the index in the other.
t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),]
t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]
# Make sure they match.
data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]
# index value index.1 value.1
# 27 3 0.7155661 3 0.65887761
# 10 12 0.6049333 12 0.14362694
# 88 14 0.7410786 14 0.42021589
# 56 15 0.4525708 15 0.78101754
# 38 18 0.2075451 18 0.70277874
# 24 23 0.4314737 23 0.78218212
# 34 32 0.1734423 32 0.85508236
# 22 38 0.7317925 38 0.56426384
# 84 39 0.3913593 39 0.09485786
# 5 40 0.7789147 40 0.31248966
# 74 43 0.7799849 43 0.10910096
# 71 45 0.2847905 45 0.26787813
# 57 46 0.1751268 46 0.17719454
# 25 48 0.1482116 48 0.99607737
# 81 53 0.6304141 53 0.26721208
# 60 58 0.8645449 58 0.96920881
# 30 59 0.6401010 59 0.67371223
# 75 61 0.8806190 61 0.69882454
# 63 64 0.3287773 64 0.36918946
# 19 70 0.9240745 70 0.11350771
回答3:
Really human comprehensible example (as this is the first time I am using %in%), how to compare two data frames and keep only rows containing the equal values in specific column:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data frames.
data_A <- data.frame(id=c(1,2,3), value=c(1,2,3))
data_B <- data.frame(id=c(1,2,3,4), value=c(5,6,7,8))
# compare data frames by specific columns and keep only
# the rows with equal values
data_A[data_A$id %in% data_B$id,] # will keep data in data_A
data_B[data_B$id %in% data_A$id,] # will keep data in data_b
Results:
> data_A[data_A$id %in% data_B$id,]
id value
1 1 1
2 2 2
3 3 3
> data_B[data_B$id %in% data_A$id,]
id value
1 1 5
2 2 6
3 3 7
回答4:
Per the comments to the original post, merges / joins are well-suited for this problem. In particular, an inner join will return only values that are present in both dataframes, making thesetdiff
statement unnecessary.
Using the data from Dinre's example:
In base R:
cleanedA <- merge(data_A, data_B[, "index"], by = 1, sort = FALSE)
cleanedB <- merge(data_B, data_A[, "index"], by = 1, sort = FALSE)
Using the dplyr package:
library(dplyr)
cleanedA <- inner_join(data_A, data_B %>% select(index))
cleanedB <- inner_join(data_B, data_A %>% select(index))
To keep the data as two separate tables, each containing only its own variables, this subsets the unwanted table to only its index variable before joining. Then no new variables are added to the resulting table.