我有以下2个data.frames:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
我想找到的行A1具有A2没有。
是否有一个内置的功能为这种类型的操作?
(PS:我没有为它编写一个解决方案,我只是好奇,如果有人已经取得了较为制作的代码)
这里是我的解决方案:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
Answer 1:
这并不直接回答你的问题,但它会给你,是常用的元素。 这与保罗的Murrell的包进行compare
:
library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c
功能compare
为您提供了很大的灵活性什么样的比较是允许条件(切换顺序每个向量的元素,和变量名,缩短变量如切换顺序,换弦的情况下)。 由此看来,你应该能够找出从一个或其他失踪。 例如(这是不是很优雅):
difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e
Answer 2:
SQLDF
提供了一个很好的解决方案
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
require(sqldf)
a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')
并在两种数据帧的行:
a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')
新版本的dplyr
具有的功能, anti_join
,恰好为这些类型的比较
require(dplyr)
anti_join(a1,a2)
而semi_join
来过滤行a1
,同时也是在a2
semi_join(a1,a2)
Answer 3:
在dplyr:
setdiff(a1,a2)
基本上, setdiff(bigFrame, smallFrame)
让你在第一个表中的额外记录。
在SQLverse这就是所谓的
对于所有的好说明加入的选项和设置科目,这是我见过放在一起迄今为止最好的总结之一: http://www.vertabelo.com/blog/technical-articles/sql-joins
但是,回到这个问题-这里有结果了setdiff()
使用OP的数据时的代码:
> a1
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
> a2
a b
1 1 a
2 2 b
3 3 c
> setdiff(a1,a2)
a b
1 4 d
2 5 e
甚至anti_join(a1,a2)
会得到相同的结果。
欲了解更多信息: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Answer 4:
这肯定不是效率为这个特定的目的,但我经常做在这些情况下是插入每个data.frame指标变量,然后合并:
a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)
在included_a1缺失值会注意到这行的A1失踪。 同样,对于A2。
与您的解决方案的一个问题是,列订单必须匹配。 另一个问题是,它很容易想象那里的时候,其实是不同的行被编码为相同的情况。 利用合并的好处是,你可以免费获得所有的错误检查是必要的一个好办法。
Answer 5:
我写了一个包( https://github.com/alexsanjoseph/compareDF ),因为我有同样的问题。
> df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
> df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
> df_compare = compare_df(df1, df2, "row")
> df_compare$comparison_df
row chng_type a b
1 4 + 4 d
2 5 + 5 e
一个更复杂的例子:
library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Duster 360", "Merc 240D"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
hp = c(110, 110, 181, 110, 245, 62),
cyl = c(6, 6, 4, 6, 8, 4),
qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))
df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
hp = c(110, 110, 93, 110, 175, 105),
cyl = c(6, 6, 4, 6, 8, 6),
qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))
> df_compare$comparison_df
grp chng_type id1 id2 hp cyl qsec
1 1 - Hornet Sportabout Dus 175 8 17.02
2 2 + Datsun 710 Dat 181 4 33.00
3 2 - Datsun 710 Dat 93 4 18.61
4 3 + Duster 360 Dus 245 8 15.84
5 7 + Merc 240D Mer 62 4 20.00
6 8 - Valiant Val 105 6 20.22
该软件包还具有快速检查的html_output命令
df_compare $ html_output
Answer 6:
您可以使用daff
包装 (包装了daff.js
库使用V8
包 ):
library(daff)
diff_data(data_ref = a2,
data = a1)
产生以下差异对象:
Daff Comparison: ‘a2’ vs. ‘a1’
First 6 and last 6 patch lines:
@@ a b
1 ... ... ...
2 3 c
3 +++ 4 d
4 +++ 5 e
5 ... ... ...
6 ... ... ...
7 3 c
8 +++ 4 d
9 +++ 5 e
的差异格式中描述Coopy荧光笔差异格式为表和应该是不言自明。 与线+++
在第一列@@
是它们是在新的那些a1
和在不存在a2
。
差对象可以用来patch_data()
用来存储以作记录的差write_diff()
或使用以可视化的差render_diff()
render_diff(
diff_data(data_ref = a2,
data = a1)
)
产生整齐HTML输出:
Answer 7:
使用diffobj
包:
library(diffobj)
diffPrint(a1, a2)
diffObj(a1, a2)
Answer 8:
我适应了merge
函数来获取此功能。 在较大dataframes它使用的不是完整的合并方案更少的内存。 我可以与键列的名字玩。
另一种解决方案是使用图书馆prob
。
# Derived from src/library/base/R/merge.R
# Part of the R package, http://www.R-project.org
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# A copy of the GNU General Public License is available at
# http://www.r-project.org/Licenses/
XinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = FALSE, incomparables = NULL,
...)
{
fix.by <- function(by, df)
{
## fix up 'by' to be a valid set of cols by number: 0 is row.names
if(is.null(by)) by <- numeric(0L)
by <- as.vector(by)
nc <- ncol(df)
if(is.character(by))
by <- match(by, c("row.names", names(df))) - 1L
else if(is.numeric(by)) {
if(any(by < 0L) || any(by > nc))
stop("'by' must match numbers of columns")
} else if(is.logical(by)) {
if(length(by) != nc) stop("'by' must match number of columns")
by <- seq_along(by)[by]
} else stop("'by' must specify column(s) as numbers, names or logical")
if(any(is.na(by))) stop("'by' must specify valid column(s)")
unique(by)
}
nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
by.x <- fix.by(by.x, x)
by.y <- fix.by(by.y, y)
if((l.b <- length(by.x)) != length(by.y))
stop("'by.x' and 'by.y' specify different numbers of columns")
if(l.b == 0L) {
## was: stop("no columns to match on")
## returns x
x
}
else {
if(any(by.x == 0L)) {
x <- cbind(Row.names = I(row.names(x)), x)
by.x <- by.x + 1L
}
if(any(by.y == 0L)) {
y <- cbind(Row.names = I(row.names(y)), y)
by.y <- by.y + 1L
}
## create keys from 'by' columns:
if(l.b == 1L) { # (be faster)
bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
} else {
## Do these together for consistency in as.character.
## Use same set of names.
bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
bx <- bz[seq_len(nx)]
by <- bz[nx + seq_len(ny)]
}
comm <- match(bx, by, 0L)
if (notin) {
res <- x[comm == 0,]
} else {
res <- x[comm > 0,]
}
}
## avoid a copy
## row.names(res) <- NULL
attr(res, "row.names") <- .set_row_names(nrow(res))
res
}
XnotinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = TRUE, incomparables = NULL,
...)
{
XinY(x,y,by,by.x,by.y,notin,incomparables)
}
Answer 9:
您的示例数据不具有任何重复,但你的解决方案能够自动处理它们。 这意味着潜在的一些问题的答案将不会匹配在重复的情况下,你的函数的结果。
这里是我的解决方案,它的地址复制的方式和你一样。 它还扩展太棒了!
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
library(data.table)
setDT(a1)
setDT(a2)
# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE
# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE
它需要data.table 1.9.8+
Answer 10:
也许是太简单了,但我用这个解决方案,我发现它非常有用,当我有,我可以用它来比较数据集的主键。 希望它可以帮助。
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]
Answer 11:
然而,另一种解决方案基于在plyr match_df。 这里的plyr的match_df:
match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[keys$x %in% keys$y, , drop = FALSE]
}
我们可以修改它否定:
library(plyr)
negate_match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[!(keys$x %in% keys$y), , drop = FALSE]
}
然后:
diff <- negate_match_df(a1,a2)
Answer 12:
使用subset
:
missing<-subset(a1, !(a %in% a2$a))
文章来源: Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2