可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a data set containing (amongst others) multiple columns with dates and corresponding values (repeated measurements). Is there a way to turn this into a long data set containing (the others and) only two columns - one for dates and one for values - using tidyr
?
The following code produces an example data frame:
df <- data.frame(
id = 1:10,
age = sample(100, 10),
date1 = as.Date('2015-09-22') - sample(100, 10),
value1 = sample(100, 10),
date2 = as.Date('2015-09-22') - sample(100, 10),
value2 = sample(100, 10),
date3 = as.Date('2015-09-22') - sample(100, 10),
value3 = sample(100, 10))
The input table could (chance of 1
in 1.8x10^138
) look like this:
id age date1 value1 date2 value2 date3 value3
1 1 32 2015-08-01 37 2015-07-15 38 2015-09-09 81
2 2 33 2015-07-22 16 2015-06-26 1 2015-09-12 58
...
10 10 64 2015-07-23 78 2015-08-25 70 2015-08-05 90
What I finally want is this:
id age date value
1 1 32 2015-08-01 37
2 1 32 2015-07-15 38
3 1 32 2015-09-09 81
4 2 33 2015-07-22 16
5 2 33 2015-06-26 1
...
30 10 64 2015-08-05 90
Any help doing this in tidyr
or reshape
would be greatly appreciated.
回答1:
There should be some efficient way, but this is one way.
Working separately for date and value,
#for date
df.date<-df%>%select(id, age,date1,date2, date3)%>%melt(id.var=c("id", "age"), value.name="date")
#for val
df.val<-df%>%select(id, age,value1,value2, value3)%>%melt(id.var=c("id", "age"), value.name="value")
Now join,
df2<-full_join(df.date, df.val, by=c("id", "age"))
df2%>%select(-variable.x, -variable.y)
id age date value
1 1 40 2015-07-19 28
2 1 40 2015-07-19 49
3 1 40 2015-07-19 24
4 2 33 2015-06-27 99
5 2 33 2015-06-27 18
6 2 33 2015-06-27 26
7 3 75 2015-07-07 63
8 3 75 2015-07-07 74
9 3 75 2015-07-07 72
回答2:
I stumbled across this trying to learn about using gather
with a mix of dates and values.
The existing answers lose information about which instance the date-value pair came from, ie, instance 1 for date1 & value1, etc. This may not be important, but here's a tidyverse option that keeps the instance.
library(stringr) # not necessary but nice
library(tidyr)
library(dplyr)
df %>%
gather(key, val, -id, -age) %>%
mutate(
measure = str_sub(key,1,-2),
instance = str_sub(key, -1)
) %>%
select(-key) %>%
spread(measure, val) %>%
mutate(date = as.Date(date, origin="1970-01-01")) # restore date class
回答3:
I had the exact same question and data format for a dataset I was working on. Crowdsourced the answer at work. A couple of us came up with a single tidyr
and dplyr
pipeline solution. Using the same simulated df from original question.
df %>%
gather(key = date_position, value = date, starts_with("date")) %>%
gather(key = value_position, value = value, starts_with("value")) %>%
mutate(date_position = gsub('[^0-9]', "", date_position),
value_position = gsub('[^0-9]', "", value_position)) %>%
filter(date_position == value_position) %>%
select(-ends_with("position")) %>%
arrange(id)
回答4:
The same strategy but using tidyr
instead looks as follows:
df.value <- df %>%
gather(key="foo", value="value", starts_with("value"))
df.date <- df %>%
gather(key="bar", value="date", starts_with("date"))
After controlling the resulting dimensions (careful with NA
values - there is also a na.rm
argument to the gather
function) I joined the data.frames using base/dplyr functions:
df.long <- data.frame(select(df.value, id, age, value), select(df.date, date))
I am certain there is a much more elgant way to both parts, but it did the trick.
回答5:
This does a reshape
and then sorts the rows.
The first two lines just set up the v.names
and varying
arguments to reshape
. v.names
defines the new column names and varying
, is a list whose two components contain logical selection vectors of the date
and value
columns respectively.
The last line of code does the sorting and can be omitted if the row order does not matter.
No packages are used.
v.names <- c("date", "value")
varying <- lapply(v.names, startsWith, x = names(df))
r <- reshape(df, dir = "long", varying = varying, v.names = v.names)
r[order(r$id, r$time), ]
giving the following where the id and time columns relate the output rows back to the input:
id age time date value
1.1 1 12 1 2015-08-14 3
1.2 1 12 2 2015-07-11 24
1.3 1 12 3 2015-07-04 4
2.1 2 92 1 2015-08-03 17
2.2 2 92 2 2015-07-19 52
2.3 2 92 3 2015-07-01 93
3.1 3 28 1 2015-08-24 86
3.2 3 28 2 2015-08-12 80
3.3 3 28 3 2015-09-01 56
4.1 4 45 1 2015-09-13 78
4.2 4 45 2 2015-07-07 92
4.3 4 45 3 2015-08-10 81
5.1 5 25 1 2015-08-27 95
5.2 5 25 2 2015-09-08 68
5.3 5 25 3 2015-06-27 82
6.1 6 1 1 2015-08-21 16
6.2 6 1 2 2015-06-15 35
6.3 6 1 3 2015-07-24 30
7.1 7 7 1 2015-07-19 59
7.2 7 7 2 2015-07-08 33
7.3 7 7 3 2015-08-11 49
8.1 8 71 1 2015-07-28 19
8.2 8 71 2 2015-06-29 74
8.3 8 71 3 2015-08-05 25
9.1 9 59 1 2015-07-05 64
9.2 9 59 2 2015-09-04 30
9.3 9 59 3 2015-07-30 74
10.1 10 96 1 2015-09-12 69
10.2 10 96 2 2015-07-23 72
10.3 10 96 3 2015-08-19 23