I have read a CSV
file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in that column. For example:
platform_external_dbus 202 16 google 1
platform_external_dbus 202 16 space-ghost.verbum 1
platform_external_dbus 202 16 localhost 1
platform_external_dbus 202 16 users.sourceforge 8
platform_external_dbus 202 16 hughsie 1
I would like only one of these rows since the others have the same data in the first column.
just isolate your data frame to the columns you need, then use the unique function :D
The function
distinct()
in the dplyr package performs arbitrary duplicate removal, allowing the specification of the duplicated variables (as in this question) or considering all variables.Data:
Remove rows where specified columns are duplicated:
Remove rows that are complete duplicates of other rows:
For people who have come here to look for a general answer for duplicate row removal, use
!duplicated()
:Answer from: Removing duplicated rows from R data frame
The
data.table
package also hasunique
andduplicated
methods of it's own with some additional features.Both the
unique.data.table
and theduplicated.data.table
methods have an additionalby
argument which allows you to pass acharacter
orinteger
vector of column names or their locations respectivelyAnother important feature of these methods is a huge performance gain for larger data sets
Or you could nest the data in cols 4 and 5 into a single row with
tidyr
:The col 2 and 3 duplicates are now removed for statistical analysis, but you have kept the col 4 and 5 data in a tibble and can go back to the original data frame at any point with
unnest()
.With
sqldf
:Solution:
Output: