Remove duplicated rows

2018-12-31 06:50发布

I have read a CSV file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in that column. For example:

platform_external_dbus          202           16                     google        1
platform_external_dbus          202           16         space-ghost.verbum        1
platform_external_dbus          202           16                  localhost        1
platform_external_dbus          202           16          users.sourceforge        8
platform_external_dbus          202           16                    hughsie        1

I would like only one of these rows since the others have the same data in the first column.

8条回答
冷夜・残月
2楼-- · 2018-12-31 07:08

just isolate your data frame to the columns you need, then use the unique function :D

# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them, 
# so they're duplicates and thrown out.
查看更多
与风俱净
3楼-- · 2018-12-31 07:09

The function distinct() in the dplyr package performs arbitrary duplicate removal, allowing the specification of the duplicated variables (as in this question) or considering all variables.

Data:

dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))

Remove rows where specified columns are duplicated:

library(dplyr)
dat %>% distinct(a, .keep_all = TRUE)

  a b
1 1 A
2 2 B

Remove rows that are complete duplicates of other rows:

dat %>% distinct

  a b
1 1 A
2 2 B
3 1 C
4 2 D
查看更多
余欢
4楼-- · 2018-12-31 07:10

For people who have come here to look for a general answer for duplicate row removal, use !duplicated():

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)

duplicated(df)
[1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

> df[duplicated(df), ]
  a b
2 A 1
6 B 1
8 C 2

> df[!duplicated(df), ]
  a b
1 A 1
3 A 2
4 B 4
5 B 1
7 C 2

Answer from: Removing duplicated rows from R data frame

查看更多
其实,你不懂
5楼-- · 2018-12-31 07:13

The data.table package also has unique and duplicated methods of it's own with some additional features.

Both the unique.data.table and the duplicated.data.table methods have an additional by argument which allows you to pass a character or integer vector of column names or their locations respectively

library(data.table)
DT <- data.table(id = c(1,1,1,2,2,2),
                 val = c(10,20,30,10,20,30))

unique(DT, by = "id")
#    id val
# 1:  1  10
# 2:  2  10

duplicated(DT, by = "id")
# [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Another important feature of these methods is a huge performance gain for larger data sets

library(microbenchmark)
library(data.table)
set.seed(123)
DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10))
DT <- copy(DF)
setDT(DT)

microbenchmark(unique(DF), unique(DT))
# Unit: microseconds
#       expr       min         lq      mean    median        uq       max neval cld
# unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18   100   b
# unique(DT)   746.855   776.6145  2201.657   864.932   919.489  55986.88   100  a 


microbenchmark(duplicated(DF), duplicated(DT))
# Unit: microseconds
#           expr       min         lq       mean     median        uq        max neval cld
# duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170   100   b
# duplicated(DT)   551.982   558.2215   851.0246   639.9795   663.658   5805.243   100  a 
查看更多
刘海飞了
6楼-- · 2018-12-31 07:17

Or you could nest the data in cols 4 and 5 into a single row with tidyr:

library(tidyr)
df %>% nest(V4:V5)

# A tibble: 1 × 4
#                      V1    V2    V3             data
#                  <fctr> <int> <int>           <list>
#1 platform_external_dbus   202    16 <tibble [5 × 2]>

The col 2 and 3 duplicates are now removed for statistical analysis, but you have kept the col 4 and 5 data in a tibble and can go back to the original data frame at any point with unnest().

查看更多
余欢
7楼-- · 2018-12-31 07:29

With sqldf:

# Example by Mehdi Nellen
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)

Solution:

 library(sqldf)
    sqldf('SELECT DISTINCT * FROM df')

Output:

  a b
1 A 1
2 A 2
3 B 4
4 B 1
5 C 2
查看更多
登录 后发表回答