Using spread with duplicate identifiers for rows

2020-01-23 12:51发布

I have a long form dataframe that have multiple entries for same date and person.

jj <- data.frame(month=rep(1:3,4),
             student=rep(c("Amy", "Bob"), each=6),
             A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
             B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))

I want to convert it to wide form and make it like this:

month Amy.A Bob.A Amy.B Bob.B
1     
2     
3
1
2
3
1
2
3
1
2
3

My question is very similar to this. I have used the given code in the answer :

kk <- jj %>% 
  gather(variable, value, -(month:student)) %>% 
  unite(temp, student, variable) %>% 
  spread(temp, value)

but it gives following error:

Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6), (13, 16), (14, 17), (15, 18), (7, 10), (8, 11), (9, 12), (19, 22), (20, 23), (21, 24)

Thanks in advance. Note: I don't want to delete multiple entries.

标签: r dplyr tidyr
4条回答
Bombasti
2楼-- · 2020-01-23 13:15

If we create a unique sequence, then we can the output in the correct format with pivot_wider

library(dplyr)
library(tidyr)
jj %>%
   group_by(month, student) %>% 
   mutate(rn = row_number()) %>%
   pivot_wider(names_from = 'student', values_from = c('A', 'B'), 
          names_sep='.')  %>% 
   select(-rn)
# A tibble: 6 x 5
# Groups:   month [3]
#  month A.Amy A.Bob B.Amy B.Bob
#  <int> <dbl> <dbl> <dbl> <dbl>
#1     1     9     3     6     5
#2     2     7     2     7     4
#3     3     6     1     8     6
#4     1     8     5     5     3
#5     2     6     6     6     1
#6     3     9     5     7     5

data

jj <- structure(list(month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L), student = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("Amy", "Bob"), class = "factor"), 
    A = c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5), B = c(6, 7, 8, 
    5, 6, 7, 5, 4, 6, 3, 1, 5)), class = "data.frame", row.names = c(NA, 
-12L))
查看更多
再贱就再见
3楼-- · 2020-01-23 13:27

The issue is the two columns for both A and B. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt when you use the code below.

library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1    17    11     8     8
# 2     2    13    13     8     5
# 3     3    15    15     6    11

I won't mark this as a duplicate since the other question did not summarize by sum, but the data.table answer could help with one additional argument, fun=sum:

library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
#    month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1:     1        17         8        11         8
# 2:     2        13         8        13         5
# 3:     3        15         6        15        11

If you would like to use the tidyr solution, combine it with dcast to summarize by sum.

as.data.frame(jj)
library(tidyr)
jj %>% 
  gather(variable, value, -(month:student)) %>%
  unite(temp, student, variable) %>%
  dcast(month ~ temp, fun=sum)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1    17    11     8     8
# 2     2    13    13     8     5
# 3     3    15    15     6    11

Edit

Based on your new requirements, I have added an activity column.

library(dplyr)
jj %>% group_by(month, student) %>% 
  mutate(id=1:n()) %>%
  melt(id=c("month", "id", "student")) %>%
  dcast(... ~ student + variable, value.var="value")
#   month id Amy_A Amy_B Bob_A Bob_B
# 1     1  1     9     6     3     5
# 2     1  2     8     5     5     3
# 3     2  1     7     7     2     4
# 4     2  2     6     6     6     1
# 5     3  1     6     8     1     6
# 6     3  2     9     7     5     5

The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:

library(tidyr)
jj %>% 
  gather(variable, value, -(month:student)) %>%
  unite(temp, student, variable) %>%
  group_by(temp) %>%
  mutate(id=1:n()) %>%
  dcast(... ~ temp) %>%
  arrange(id)
#   month id Amy_A Amy_B Bob_A Bob_B
# 1     1  1     9     6     3     5
# 2     2  2     7     7     2     4
# 3     3  3     6     8     1     6
# 4     1  4     8     5     5     3
# 5     2  5     6     6     6     1
# 6     3  6     9     7     5     5

The data.table syntax is compact because it allows for multiple value.var columns and will take care of the spread for us. We can then skip the melt -> cast process.

library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
#    month activityID A_Amy A_Bob B_Amy B_Bob
# 1:     1          1     9     3     6     5
# 2:     1          4     8     5     5     3
# 3:     2          2     7     2     7     4
# 4:     2          5     6     6     6     1
# 5:     3          3     6     1     8     6
# 6:     3          6     9     5     7     5
查看更多
Summer. ? 凉城
4楼-- · 2020-01-23 13:28

Your answer was missing mutate id! Here is the solution using dplyr packge only.

jj %>% 
  gather(variable, value, -(month:student)) %>% 
  unite(temp, student, variable) %>% 
  group_by(temp) %>% 
  mutate(id=1:n()) %>% 
  spread(temp, value) 
#  A tibble: 6 x 6
#  month    id Amy_A Amy_B Bob_A Bob_B
# * <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1     1     1     9     6     3     5
# 2     1     4     8     5     5     3
# 3     2     2     7     7     2     4
# 4     2     5     6     6     6     1
# 5     3     3     6     8     1     6
# 6     3     6     9     7     5     5
查看更多
叼着烟拽天下
5楼-- · 2020-01-23 13:32

Since tidyr 1.0.0 pivot_wider is the recommended replacement of spread and you could do the following :

jj <- data.frame(month=rep(1:3,4),
                 student=rep(c("Amy", "Bob"), each=6),
                 A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
                 B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))

library(tidyr)

pivot_wider(
  jj,
  names_from = "student",
  values_from = c("A","B"),
  names_sep = ".",
  values_fn = list(A= list, B= list)) %>%
  unchop(everything())
#> # A tibble: 6 x 5
#>   month A.Amy A.Bob B.Amy B.Bob
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1     9     3     6     5
#> 2     1     8     5     5     3
#> 3     2     7     2     7     4
#> 4     2     6     6     6     1
#> 5     3     6     1     8     6
#> 6     3     9     5     7     5

Created on 2019-09-14 by the reprex package (v0.3.0)

The twist in this problem is that month is not unique by student, to solve this :

  • values_fn = list(A= list, B= list)) puts the multiple values in a list
  • unchop(everything()) unnest the lists vertically, you can use unnest as well here
查看更多
登录 后发表回答