Is it possible to select all unique values from a column of a data.frame
using select
function in dplyr
library?
Something like "SELECT DISTINCT field1 FROM table1
" in SQL
notation.
Thanks!
Is it possible to select all unique values from a column of a data.frame
using select
function in dplyr
library?
Something like "SELECT DISTINCT field1 FROM table1
" in SQL
notation.
Thanks!
In dplyr 0.3 this can be easily achieved using the distinct()
method.
Here is an example:
distinct_df = df %>% distinct(field1)
You can get a vector of the distinct values with:
distinct_vector = distinct_df$field1
You can also select a subset of columns at the same time as you perform the distinct()
call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:
distinct_df = df %>% distinct(field1) %>% select(field1)
distinct_vector = distinct_df$field1
Just to add to the other answers, if you would prefer to return a vector rather than a dataframe, you have the following options:
dplyr < 0.7.0
Enclose the dplyr functions in a parentheses and combine it with $
syntax:
(mtcars %>% distinct(cyl))$cyl
dplyr >= 0.7.0
Use the pull
verb:
mtcars %>% distinct(cyl) %>% pull()
The dplyr
select
function selects specific columns from a data frame. To return unique values in a particular column of data, you can use the group_by
function. For example:
library(dplyr)
# Fake data
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE))
# Return the distinct values of x
dat %>%
group_by(x) %>%
summarise()
x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
If you want to change the column name you can add the following:
dat %>%
group_by(x) %>%
summarise() %>%
select(unique.x=x)
This both selects column x
from among all the columns in the data frame that dplyr
returns (and of course there's only one column in this case) and changes its name to unique.x
.
You can also get the unique values directly in base R
with unique(dat$x)
.
If you have multiple variables and want all unique combinations that appear in the data, you can generalize the above code as follows:
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE),
y=sample(letters[1:5], 100, replace=TRUE))
dat %>%
group_by(x,y) %>%
summarise() %>%
select(unique.x=x, unique.y=y)