I am new to data analysis with R. I recently got a pre-formatted environmental observation-model dataset, an example subset of which is shown below:
date site obs mod site obs mod
2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69
2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75
2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65
Basically, the data include the time series of hourly observed and modelled concentrations of a pollutant at various sites in "reoccurring columns", i.e., site - obs - mod (in the example I only showed 2 out of the total 75 sites). I read this "wide" dataset in as a data frame, and wanted to reshape it into the "narrower" format as:
date site obs mod
2000-09-01 00:00:00 campus NA 61.63
2000-09-01 01:00:00 campus 52 62.55
2000-09-01 02:00:00 campus 52 63.52
2000-09-01 00:00:00 city centre 66 56.69
2000-09-01 01:00:00 city centre NA 54.75
2000-09-01 02:00:00 city centre 56 54.65
I believed that I should use the package "reshape2" to do this. Firstly I tried to melt and then dcast the dataset:
test.melt <- melt(test.data, id.vars = "date", measure.vars = c("site", "obs", "mod"))
However, it only returned half of the data, i.e., records of the site(s) ("city centre") following the first one ("campus") were all cut off:
date variable value
2001-01-01 00:00:00 site campus
2001-01-01 01:00:00 site campus
2001-01-01 02:00:00 site campus
2001-01-01 00:00:00 obs NA
2001-01-01 01:00:00 obs 52
2001-01-01 02:00:00 obs 52
2001-01-01 00:00:00 mod 61.63
2001-01-01 01:00:00 mod 62.55
2001-01-01 02:00:00 mod 63.52
I then tried recast:
test.recast <- recast(test.data, date ~ site + obs + mod)
However, it returned with error message:
Error in eval(expr, envir, enclos) : object 'site' not found
I have tried to search for previous questions but have not found similar scenarios (correct me if I am wrong). Could someone please help me with this?
Many thanks in advance!
You might be better off using base R reshape after doing some variable name cleanup.
Here's your data.
test <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "date site obs mod site obs mod
'2000-09-01 00:00:00' campus NA 61.63 'city centre' 66 56.69
'2000-09-01 01:00:00' campus 52 62.55 'city centre' NA 54.75
'2000-09-01 02:00:00' campus 52 63.52 'city centre' 56 54.65")
test
# date site obs mod site.1 obs.1 mod.1
# 1 2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69
# 2 2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75
# 3 2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65
If you did this correctly, you should get names like I got: as @chase mentions in his answer, "recurring column names is a bit of an oddity and is not normal R behaviour"--so we've got to fix that.
Note: Both of these options generate a "time" variable which you can go ahead and drop. You might want to keep it just in case you wanted to reshape back into a wide format.
Option 1: If you got names like I did (which you should have), the solution is simple. For the first site, just append "0" to the site name and use base R reshape:
names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
test <- reshape(test, direction = "long",
idvar = "date", varying = 2:ncol(test))
rownames(test) <- NULL # reshape makes UGLY rownames
test
# date time site obs mod
# 1 2000-09-01 00:00:00 0 campus NA 61.63
# 2 2000-09-01 01:00:00 0 campus 52 62.55
# 3 2000-09-01 02:00:00 0 campus 52 63.52
# 4 2000-09-01 00:00:00 1 city centre 66 56.69
# 5 2000-09-01 01:00:00 1 city centre NA 54.75
# 6 2000-09-01 02:00:00 1 city centre 56 54.65
Option 2: If you really do have duplicated column names, the fix is still easy, and follows the same logic. First, create nicer column names (easy to do using rep()
), and then use reshape()
as described above.
names(test)[-1] <- paste(names(test)[-1],
rep(1:((ncol(test)-1)/3), each = 3), sep = ".")
test <- reshape(test, direction = "long",
idvar = "date", varying = 2:ncol(test))
rownames(test) <- NULL
### Or, more convenient:
# names(test) <- make.unique(names(test))
# names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
# test <- reshape(test, direction = "long",
# idvar = "date", varying = 2:ncol(test))
# rownames(test) <- NULL
Optional step: The data in this form are still not totally "long". If that is required, all that is required is one more step:
require(reshape2)
melt(test, id.vars = c("date", "site", "time"))
# date site time variable value
# 1 2000-09-01 00:00:00 campus 0 obs NA
# 2 2000-09-01 01:00:00 campus 0 obs 52.00
# 3 2000-09-01 02:00:00 campus 0 obs 52.00
# 4 2000-09-01 00:00:00 city centre 1 obs 66.00
# 5 2000-09-01 01:00:00 city centre 1 obs NA
# 6 2000-09-01 02:00:00 city centre 1 obs 56.00
# 7 2000-09-01 00:00:00 campus 0 mod 61.63
# 8 2000-09-01 01:00:00 campus 0 mod 62.55
# 9 2000-09-01 02:00:00 campus 0 mod 63.52
# 10 2000-09-01 00:00:00 city centre 1 mod 56.69
# 11 2000-09-01 01:00:00 city centre 1 mod 54.75
# 12 2000-09-01 02:00:00 city centre 1 mod 54.65
Update (to try to address some questions from the comments)
The reshape()
documentation is quite confusing. It's best to work through a few examples to get an understanding of how it works. Specifically, "time" does not have to refer to time ("date" in your problem), but is more for, say, panel data, where records are collected at different times for the same ID. In your case, the only "id" in the original data is the "date" column. The other potential "id" is the site, but not in the way the data are organized.
Imagine, for a moment, if your data looked like this:
test1 <- structure(list(date = structure(1:3,
.Label = c("2000-09-01 00:00:00",
"2000-09-01 01:00:00", "2000-09-01 02:00:00"), class = "factor"),
obs.campus = c(NA, 52L, 52L), mod.campus = c(61.63, 62.55,
63.52), obs.cityCentre = c(66L, NA, 56L), mod.cityCentre = c(56.69,
54.75, 54.65)), .Names = c("date", "obs.campus", "mod.campus",
"obs.cityCentre", "mod.cityCentre"), class = "data.frame", row.names = c(NA,
-3L))
test1
# date obs.campus mod.campus obs.cityCentre mod.cityCentre
# 1 2000-09-01 00:00:00 NA 61.63 66 56.69
# 2 2000-09-01 01:00:00 52 62.55 NA 54.75
# 3 2000-09-01 02:00:00 52 63.52 56 54.65
Now try reshape(test1, direction = "long", idvar = "date", varying = 2:ncol(test1))
. You'll see that reshape()
sees the site names as "time" (that can be overridden by adding "timevar = "site"
" to your reshape
command).
When direction = "long"
, you must specify which columns vary with "time". In your case, that is all the columns except for the first, hence my use of 2:ncol(test)
for "varying
".
test2
? Where is that?
Question under @Chase's answer: I think you misunderstand how melt()
is supposed to work. Basically, it tries to get you the "skinniest" form of your data. In this case, the skinniest form would be the "optional step" described above since date
+ site
would be the minimum required to comprise a unique ID variable. (I would say that "time
" can safely be dropped.)
Once your data are in the format described in the "optional step" (we'll assume that the output has been stored as "test.melt
", you can always easily pivot the table around in different ways. As a demonstration of what I mean by that, try the following and see what they do.
dcast(test.melt, date + site ~ variable)
dcast(test.melt, date ~ variable + site)
dcast(test.melt, variable + site ~ date)
dcast(test.melt, variable + date ~ site)
It is not easy to have that flexibility if you stop at "Option 1" or "Option 2".
Update (a few years later)
melt
from "data.table" can now "melt" multiple columns in a similar way that reshape
does. It should work whether or not the column names are duplicated.
You can try the following:
measure <- c("site", "obs", "mod")
melt(as.data.table(test), measure.vars = patterns(measure), value.name = measure)
# date variable site obs mod
# 1: 2000-09-01 00:00:00 1 campus NA 61.63
# 2: 2000-09-01 01:00:00 1 campus 52 62.55
# 3: 2000-09-01 02:00:00 1 campus 52 63.52
# 4: 2000-09-01 00:00:00 2 city centre 66 56.69
# 5: 2000-09-01 01:00:00 2 city centre NA 54.75
# 6: 2000-09-01 02:00:00 2 city centre 56 54.65
The fact that you have recurring column names is a bit of an oddity and is not normal R behaviour. Most of the time R forces you to have valid names via the make.names()
function. Regardless, I'm able to duplicate your problem. Note I made my own example since yours isn't reproducible, but the logic is the same.
#Do not force unique names
s <- data.frame(id = 1:3, x = runif(3), x = runif(3), check.names = FALSE)
#-----
id x x
1 1 0.6845270 0.5218344
2 2 0.7662200 0.6179444
3 3 0.4110043 0.1104774
#Now try to melt, note that 1/2 of your x-values are missing!
melt(s, id.vars = 1)
#-----
id variable value
1 1 x 0.6845270
2 2 x 0.7662200
3 3 x 0.4110043
The solution is to make your column names unique. As I said before, R does this by default in most cases. However, you can do it after the fact via make.unique()
names(s) <- make.unique(names(s))
#-----
[1] "id" "x" "x.1"
Note that the second column of x now has a 1 appended to it. Now melt()
works as you'd expect:
melt(s, id.vars = 1)
#-----
id variable value
1 1 x 0.6845270
2 2 x 0.7662200
3 3 x 0.4110043
4 1 x.1 0.5218344
5 2 x.1 0.6179444
6 3 x.1 0.1104774
At this point, if you want to treat x
and x.1
as the same variable, I think a little gsub()
or other regex function to get rid of the offending characters. THis is a workflow I use quite often.