可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a dataframe where rows are duplicated. I need to create unique rows from this. I tried a couple of options but they don't seem to work
l1 <-summarise(group_by(l,bowler,wickets),economyRate,d=unique(date))
This works for some rows but also gives the error "Expecting a single value". The dataframe 'l' looks like this
bowler overs maidens runs wickets economyRate date opposition
(fctr) (int) (int) (dbl) (dbl) (dbl) (date) (chr)
1 MA Starc 9 0 51 0 5.67 2010-10-20 India
2 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
3 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
4 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
5 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
6 MA Starc 6 0 33 2 5.50 2012-02-05 India
7 MA Starc 6 0 33 2 5.50 2012-02-05 India
8 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka
9 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka
10 MA Starc 8 0 49 0 6.12 2012-02-12 India
The date is unique and can be used to get the rows for which the row can be selected. Please let me know how this can be done.
回答1:
In the example dataset, there are more than one unique
elements of 'date' per each 'bowler', 'wickets' combination. One option would be to paste
the unique
'date' together
l %>%
group_by(bowler, wickets) %>%
summarise(economyRate= mean(economyRate), d = toString(unique(date)))
Or create 'd' as a list
column
l %>%
group_by(bowler, wickets) %>%
summarise(economyRate= mean(economyRate), d = list(unique(date)))
With respect to 'economyRate', I am guessing the OP need the mean
of that.
If we need to create a column of unique
date in the original dataset, use mutate
l %>%
group_by(bowler, wickets) %>%
mutate(d = list(unique(date)))
As the OP didn't provide the expected output, the below could be also the result
l %>%
group_by(bowler, wickets) %>%
distinct(date)
Or as @Frank mentioned
l %>%
group_by(bowler,wickets,date) %>%
slice(1L)
回答2:
If I get the intention of th OP right, he is asking to simply remove the duplicate rows. So, I would use
unique(l1)
That's what ?unique
says:
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
回答3:
Data
l <- read.table(text = "bowler overs maidens runs wickets economyRate date opposition
1 MA_Starc 9 0 51 0 5.67 2010-10-20 India
2 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
3 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
4 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
5 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
6 MA_Starc 6 0 33 2 5.50 2012-02-05 India
7 MA_Starc 6 0 33 2 5.50 2012-02-05 India
8 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
9 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
10 MA_Starc 8 0 49 0 6.12 2012-02-12 India")
Distinct
Use dplyr::distinct to remove duplicated rows.
ldistinct <- distinct(l)
# bowler overs maidens runs wickets economyRate date
# 1 MA_Starc 9 0 51 0 5.67 2010-10-20
# 2 MA_Starc 9 0 27 4 3.00 2010-11-07
# 3 MA_Starc 6 0 33 2 5.50 2012-02-05
# 4 MA_Starc 10 0 50 2 5.00 2012-02-10
# 5 MA_Starc 8 0 49 0 6.12 2012-02-12
# opposition
# 1 India
# 2 Sri-Lanka
# 3 India
# 4 Sri-Lanka
# 5 India
l2 <- summarise(group_by(ldistinct,bowler,wickets),
economyRate,d=unique(date))
# Error: expecting a single value
But it's not enough here, there are still many dates for
one combination of bowler and wickets.
Collapse values together
By pasting multiple values together you will see that there are many dates and many economyRate for a single combination of bowler and wickets.
l3 <- summarise(group_by(l,bowler,wickets),
economyRate = paste(unique(economyRate),collapse=", "),
d=paste(unique(date),collapse=", "))
l3
# bowler wickets economyRate d
# (fctr) (int) (chr) (chr)
# 1 MA_Starc 0 5.67, 6.12 2010-10-20, 2012-02-12
# 2 MA_Starc 2 5.5, 5 2012-02-05, 2012-02-10
# 3 MA_Starc 4 3 2010-11-07
回答4:
So, I took an unusual route to doing this disection, but I let the date remain a factor when it came over from the csv file I created. you could easily the date column to a factor with
l1$date<-as.factor(l1$date)
This will make that row a non-date row, you could also convert to character, either will work fine. This is what it looks like structurally.
str(l1)
'data.frame': 10 obs. of 10 variables:
$ bowler : Factor w/ 2 levels "(fctr)","MA": 2 2 2 2 2 2 2 2 2 2
$ overs : Factor w/ 2 levels "(int)","Starc": 2 2 2 2 2 2 2 2 2 2
$ maidens : Factor w/ 5 levels "(int)","10","6",..: 5 5 5 5 5 3 3 2 2 4
$ runs : Factor w/ 2 levels "(dbl)","0": 2 2 2 2 2 2 2 2 2 2
$ wickets : Factor w/ 6 levels "(dbl)","27","33",..: 6 2 2 2 2 3 3 5 5 4
$ economyRate: Factor w/ 4 levels "(dbl)","0","2",..: 2 4 4 4 4 3 3 3 3 2
$ date : Factor w/ 6 levels "(date)","3","5",..: 5 2 2 2 2 4 4 3 3 6
$ opposition : Factor w/ 6 levels "(chr)","10/20/2010",..: 2 3 3 3 3 6 6 4 4 5
$ X.1 : Factor w/ 3 levels "","India","Sri": 2 3 3 3 3 2 2 3 3 2
$ X.2 : Factor w/ 2 levels "","Lanka": 1 2 2 2 2 1 1 2 2 1
After that it is about making sure that you are using the sub-setting grammar properly with the most concise query:
l2<-l1[!duplicated(l1$date),]
And this is what is returned, 5 rows of unique data:
bowler overs maidens runs wickets economyRate date opposition X.1 X.2
2 MA Starc 9 0 51 0 5.67 10/20/2010 India
3 MA Starc 9 0 27 4 3 11/7/2010 Sri Lanka
7 MA Starc 6 0 33 2 5.5 2/5/2012 India
9 MA Starc 10 0 50 2 5 2/10/2012 Sri Lanka
11 MA Starc 8 0 49 0 6.12 2/12/2012 India
The only thing you need to be careful of is to keep that comma after the !duplicated(l1$date)
to be sure that ALL columns are searched and included in the final subset.
If you want dates or characters you can as.POSIXct
or as.character
convert them to a usable format for the rest of your manipulation.
I hope this is useful to you!