Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand.
dt2a <- select(dt, mutation.genome.position,
mutation.cds, primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382 5
## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds,
primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982 5
This is the file I am working with:
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz
dt = fread(fl)
This appears to be a result of the
group_by
Consider this caseWhen you use
distinct()
without indicating which variables to make distinct, it appears to use the grouping variable.