可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This question already has an answer here:
-
Transpose / reshape dataframe without “timevar” from long to wide format
6 answers
I'm new to R and have a simple question, as I'm still learning the style of R data manipulation/management.
I have a dataset of observations of basic clinical features (blood pressure, cholesterol, etc) over a period of time. Each observation has a patient ID and date, but are entered as separate line items. Something like this:
Patient ID Date Blood Pressure
1 21/1/14 120
1 19/3/14 134
1 3/5/14 127
I want to transform the data such that for a given variable (e.g. blood pressure), I have a data frame with one line per patient and all of the blood pressure values observed throughout time in chronological order. Something like this:
Patient ID BP1 BP2 BP3
1 120 134 127
I want to do this because I want to be able to write code to select the mean of the first three observed blood pressures, for example.
Any advice or reading recommendations would greatly be appreciated.
回答1:
You can achieve the desired formatting by reshaping your data with a number of methods including using the reshape()
function in Base R or dcast()
in the reshape2
package, but it might be easier to just be able to get to your answer directly using a form of aggregation. Here's one method using ddply()
from the plyr
package:
library(plyr)
df <- read.table(text="id date bp
1 21/1/14 120
1 19/3/14 134
1 3/5/14 127",header=TRUE)
df1 <- ddply(df, .(id), summarize, mean.bp = mean(bp[1:3]))
df1
# id mean.bp
# 1 1 127
Of course, if you really just want to do what you asked about, you can do the following:
library(reshape2)
df$bp.id <- ave(df$id,df$id,FUN=function(x) paste0("BP",seq(along=x)))
df2 <- dcast(df[df$bp.id %in% paste0("BP",1:3)], id~bp.id, value.var="bp")
df2
# id BP1 BP2 BP3
# 1 1 120 134 127
回答2:
# example dataframe
id <- c(rep(1:4,25))
date <- c(rep("21/01/14",30),rep("21/01/14",30),rep("22/01/14",30),rep("23/01/14",10))
bp <- c(rnorm(100,100))
df <- data.frame(id,date,bp)
# reorder the dataframe
library(dplyr)
df2 <- group_by(df,id) # group by id
df2 <- arrange(df2, date) # order each group by date
df3 <- mutate(df2, # add a colum with ascending number per for each group
c = 1:length(date))
# use dcast
library(reshape2)
dcast(df3[,c(1,4,3)],id~c)
回答3:
With the data.table package (which has an improved implementation of the melt
and dcast
functions of reshape2) you could do this as follows:
newdf <- dcast(setDT(df)[, idx := 1:.N, by = id], id ~ paste0("bp",idx), value.var = "bp")
Or utilizing the new rowid
function:
newdf <- dcast(setDT(df), id ~ rowid(prefix="bp",id), value.var = "bp")
both options give the same result:
> newdf
id bp1 bp2 bp3
1: 1 120 134 129
2: 2 110 124 119
But as @SamDickson said, when you want to calculate the mean of (for example) the first two blood pressure measurements, then you can also add a new variable to your existing dataframe df
with:
# using base R
df$first2mn <- ave(df$bp, df$id, FUN = function(x) mean(x[1:2]))
# using data.table
setDT(df)[, first2mn := mean(bp[1:2]), id]
which both give:
> df
id date bp first2mn
1: 1 21/1/14 120 127
2: 1 19/3/14 134 127
3: 1 3/5/14 129 127
4: 2 21/1/14 110 117
5: 2 19/3/14 124 117
6: 2 3/5/14 119 117
Or just calculate the mean with:
# using base R
aggregate(bp ~ id, df, function(x) mean(x[1:2]))
# using data.table
setDT(df)[, .(bp = mean(bp[1:2])), id]
which both give:
id bp
1 1 127
2 2 117
Used data:
df <- read.table(text="id date bp
1 21/1/14 120
1 19/3/14 134
1 3/5/14 129
2 21/1/14 110
2 19/3/14 124
2 3/5/14 119", header=TRUE)
回答4:
The other answers have provided a number of methods for calculating the mean across groups. A related post provides a number of methods for calculating group level maxima. You would need to replace max
with mean
in these answers.
Here is an additional method for reshaping wide using the base R function reshape
.
Use the data.frame provided by @jaap, add a variable for counting the observations by ID:
df$times <- ave(df$bp, df$id, FUN=seq_along)
Now, perform the reshape, dropping the unwanted date variable:
reshape(df, direction="wide", drop="date", timevar="times")
id bp.1 bp.2 bp.3
1 1 120 134 129
4 2 110 124 119