How to create unique rows in a data frame

2019-08-31 09:35发布

问题:

I have a dataframe where rows are duplicated. I need to create unique rows from this. I tried a couple of options but they don't seem to work

  l1 <-summarise(group_by(l,bowler,wickets),economyRate,d=unique(date))

This works for some rows but also gives the error "Expecting a single value". The dataframe 'l' looks like this

     bowler overs maidens  runs wickets economyRate       date opposition
     (fctr) (int)   (int) (dbl)   (dbl)       (dbl)     (date)      (chr)
1  MA Starc     9       0    51       0        5.67 2010-10-20      India
2  MA Starc     9       0    27       4        3.00 2010-11-07  Sri Lanka
3  MA Starc     9       0    27       4        3.00 2010-11-07  Sri Lanka
4  MA Starc     9       0    27       4        3.00 2010-11-07  Sri Lanka
5  MA Starc     9       0    27       4        3.00 2010-11-07  Sri Lanka
6  MA Starc     6       0    33       2        5.50 2012-02-05      India
7  MA Starc     6       0    33       2        5.50 2012-02-05      India
8  MA Starc    10       0    50       2        5.00 2012-02-10  Sri Lanka
9  MA Starc    10       0    50       2        5.00 2012-02-10  Sri Lanka
10 MA Starc     8       0    49       0        6.12 2012-02-12      India   

The date is unique and can be used to get the rows for which the row can be selected. Please let me know how this can be done.

回答1:

In the example dataset, there are more than one unique elements of 'date' per each 'bowler', 'wickets' combination. One option would be to paste the unique 'date' together

l %>%
    group_by(bowler, wickets) %>% 
    summarise(economyRate= mean(economyRate), d = toString(unique(date)))

Or create 'd' as a list column

l %>%
    group_by(bowler, wickets) %>% 
    summarise(economyRate= mean(economyRate), d = list(unique(date)))

With respect to 'economyRate', I am guessing the OP need the mean of that.

If we need to create a column of unique date in the original dataset, use mutate

l %>% 
    group_by(bowler, wickets) %>%
    mutate(d = list(unique(date)))

As the OP didn't provide the expected output, the below could be also the result

l %>%
     group_by(bowler, wickets) %>% 
     distinct(date)

Or as @Frank mentioned

l %>%
  group_by(bowler,wickets,date) %>%
  slice(1L)


回答2:

If I get the intention of th OP right, he is asking to simply remove the duplicate rows. So, I would use

unique(l1)

That's what ?unique says:

unique returns a vector, data frame or array like x but with duplicate elements/rows removed.



回答3:

Data

l <- read.table(text = "bowler overs maidens runs wickets economyRate date opposition
1 MA_Starc 9 0 51 0 5.67 2010-10-20 India
2 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
3 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
4 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
5 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
6 MA_Starc 6 0 33 2 5.50 2012-02-05 India
7 MA_Starc 6 0 33 2 5.50 2012-02-05 India
8 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
9 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
10 MA_Starc 8 0 49 0 6.12 2012-02-12 India")

Distinct

Use dplyr::distinct to remove duplicated rows.

ldistinct <- distinct(l)
#     bowler overs maidens runs wickets economyRate       date
# 1 MA_Starc     9       0   51       0        5.67 2010-10-20
# 2 MA_Starc     9       0   27       4        3.00 2010-11-07
# 3 MA_Starc     6       0   33       2        5.50 2012-02-05
# 4 MA_Starc    10       0   50       2        5.00 2012-02-10
# 5 MA_Starc     8       0   49       0        6.12 2012-02-12
#   opposition
# 1      India
# 2  Sri-Lanka
# 3      India
# 4  Sri-Lanka
# 5      India
l2 <- summarise(group_by(ldistinct,bowler,wickets),
                economyRate,d=unique(date))
# Error: expecting a single value

But it's not enough here, there are still many dates for one combination of bowler and wickets.

Collapse values together

By pasting multiple values together you will see that there are many dates and many economyRate for a single combination of bowler and wickets.

l3 <- summarise(group_by(l,bowler,wickets),
                economyRate = paste(unique(economyRate),collapse=", "),
                d=paste(unique(date),collapse=", "))
l3

#     bowler wickets economyRate                      d
#     (fctr)   (int)       (chr)                  (chr)
# 1 MA_Starc       0  5.67, 6.12 2010-10-20, 2012-02-12
# 2 MA_Starc       2      5.5, 5 2012-02-05, 2012-02-10
# 3 MA_Starc       4           3             2010-11-07


回答4:

So, I took an unusual route to doing this disection, but I let the date remain a factor when it came over from the csv file I created. you could easily the date column to a factor with

l1$date<-as.factor(l1$date)

This will make that row a non-date row, you could also convert to character, either will work fine. This is what it looks like structurally.

 str(l1)
'data.frame':   10 obs. of  10 variables:
 $ bowler     : Factor w/ 2 levels "(fctr)","MA": 2 2 2 2 2 2 2 2 2 2
 $ overs      : Factor w/ 2 levels "(int)","Starc": 2 2 2 2 2 2 2 2 2 2
 $ maidens    : Factor w/ 5 levels "(int)","10","6",..: 5 5 5 5 5 3 3 2 2 4
 $ runs       : Factor w/ 2 levels "(dbl)","0": 2 2 2 2 2 2 2 2 2 2
 $ wickets    : Factor w/ 6 levels "(dbl)","27","33",..: 6 2 2 2 2 3 3 5 5 4
 $ economyRate: Factor w/ 4 levels "(dbl)","0","2",..: 2 4 4 4 4 3 3 3 3 2
 $ date       : Factor w/ 6 levels "(date)","3","5",..: 5 2 2 2 2 4 4 3 3 6
 $ opposition : Factor w/ 6 levels "(chr)","10/20/2010",..: 2 3 3 3 3 6 6 4 4 5
 $ X.1        : Factor w/ 3 levels "","India","Sri": 2 3 3 3 3 2 2 3 3 2
 $ X.2        : Factor w/ 2 levels "","Lanka": 1 2 2 2 2 1 1 2 2 1

After that it is about making sure that you are using the sub-setting grammar properly with the most concise query:

l2<-l1[!duplicated(l1$date),]

And this is what is returned, 5 rows of unique data:

   bowler overs maidens runs wickets economyRate date opposition   X.1   X.2
2      MA Starc       9    0      51           0 5.67 10/20/2010 India      
3      MA Starc       9    0      27           4    3  11/7/2010   Sri Lanka
7      MA Starc       6    0      33           2  5.5   2/5/2012 India      
9      MA Starc      10    0      50           2    5  2/10/2012   Sri Lanka
11     MA Starc       8    0      49           0 6.12  2/12/2012 India   

The only thing you need to be careful of is to keep that comma after the !duplicated(l1$date) to be sure that ALL columns are searched and included in the final subset.

If you want dates or characters you can as.POSIXct or as.character convert them to a usable format for the rest of your manipulation.

I hope this is useful to you!



标签: r dplyr