How to reshape a dataframe with “reoccurring” colu

2020-02-05 10:51发布

问题:

I am new to data analysis with R. I recently got a pre-formatted environmental observation-model dataset, an example subset of which is shown below:

date                   site      obs    mod      site           obs    mod
2000-09-01 00:00:00    campus    NA     61.63    city centre    66     56.69
2000-09-01 01:00:00    campus    52     62.55    city centre    NA     54.75
2000-09-01 02:00:00    campus    52     63.52    city centre    56     54.65

Basically, the data include the time series of hourly observed and modelled concentrations of a pollutant at various sites in "reoccurring columns", i.e., site - obs - mod (in the example I only showed 2 out of the total 75 sites). I read this "wide" dataset in as a data frame, and wanted to reshape it into the "narrower" format as:

date                   site           obs    mod
2000-09-01 00:00:00    campus         NA     61.63
2000-09-01 01:00:00    campus         52     62.55
2000-09-01 02:00:00    campus         52     63.52
2000-09-01 00:00:00    city centre    66     56.69
2000-09-01 01:00:00    city centre    NA     54.75
2000-09-01 02:00:00    city centre    56     54.65

I believed that I should use the package "reshape2" to do this. Firstly I tried to melt and then dcast the dataset:

test.melt <- melt(test.data, id.vars = "date", measure.vars = c("site", "obs", "mod"))

However, it only returned half of the data, i.e., records of the site(s) ("city centre") following the first one ("campus") were all cut off:

date                   variable    value
2001-01-01 00:00:00    site        campus
2001-01-01 01:00:00    site        campus
2001-01-01 02:00:00    site        campus
2001-01-01 00:00:00    obs         NA
2001-01-01 01:00:00    obs         52
2001-01-01 02:00:00    obs         52
2001-01-01 00:00:00    mod         61.63
2001-01-01 01:00:00    mod         62.55
2001-01-01 02:00:00    mod         63.52

I then tried recast:

test.recast <- recast(test.data, date ~ site + obs + mod)

However, it returned with error message:

Error in eval(expr, envir, enclos) : object 'site' not found

I have tried to search for previous questions but have not found similar scenarios (correct me if I am wrong). Could someone please help me with this?

Many thanks in advance!

回答1:

You might be better off using base R reshape after doing some variable name cleanup.

Here's your data.

test <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "date             site  obs    mod    site             obs    mod
'2000-09-01 00:00:00'  campus   NA  61.63    'city centre'    66     56.69
'2000-09-01 01:00:00'  campus   52  62.55    'city centre'    NA     54.75
'2000-09-01 02:00:00'  campus   52  63.52    'city centre'    56     54.65")
test
#                  date   site obs   mod      site.1 obs.1 mod.1
# 1 2000-09-01 00:00:00 campus  NA 61.63 city centre    66 56.69
# 2 2000-09-01 01:00:00 campus  52 62.55 city centre    NA 54.75
# 3 2000-09-01 02:00:00 campus  52 63.52 city centre    56 54.65

If you did this correctly, you should get names like I got: as @chase mentions in his answer, "recurring column names is a bit of an oddity and is not normal R behaviour"--so we've got to fix that.

Note: Both of these options generate a "time" variable which you can go ahead and drop. You might want to keep it just in case you wanted to reshape back into a wide format.

  • Option 1: If you got names like I did (which you should have), the solution is simple. For the first site, just append "0" to the site name and use base R reshape:

    names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
    test <- reshape(test, direction = "long", 
                    idvar = "date", varying = 2:ncol(test))
    rownames(test) <- NULL # reshape makes UGLY rownames
    test
    #                  date time        site obs   mod
    # 1 2000-09-01 00:00:00    0      campus  NA 61.63
    # 2 2000-09-01 01:00:00    0      campus  52 62.55
    # 3 2000-09-01 02:00:00    0      campus  52 63.52
    # 4 2000-09-01 00:00:00    1 city centre  66 56.69
    # 5 2000-09-01 01:00:00    1 city centre  NA 54.75
    # 6 2000-09-01 02:00:00    1 city centre  56 54.65
    
  • Option 2: If you really do have duplicated column names, the fix is still easy, and follows the same logic. First, create nicer column names (easy to do using rep()), and then use reshape() as described above.

    names(test)[-1] <- paste(names(test)[-1], 
                             rep(1:((ncol(test)-1)/3), each = 3), sep = ".")
    test <- reshape(test, direction = "long", 
                    idvar = "date", varying = 2:ncol(test))
    rownames(test) <- NULL
    
    ### Or, more convenient:
    # names(test) <- make.unique(names(test))
    # names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
    # test <- reshape(test, direction = "long", 
    #                 idvar = "date", varying = 2:ncol(test))
    # rownames(test) <- NULL
    
  • Optional step: The data in this form are still not totally "long". If that is required, all that is required is one more step:

    require(reshape2)
    melt(test, id.vars = c("date", "site", "time"))
    #                   date        site time variable value
    # 1  2000-09-01 00:00:00      campus    0      obs    NA
    # 2  2000-09-01 01:00:00      campus    0      obs 52.00
    # 3  2000-09-01 02:00:00      campus    0      obs 52.00
    # 4  2000-09-01 00:00:00 city centre    1      obs 66.00
    # 5  2000-09-01 01:00:00 city centre    1      obs    NA
    # 6  2000-09-01 02:00:00 city centre    1      obs 56.00
    # 7  2000-09-01 00:00:00      campus    0      mod 61.63
    # 8  2000-09-01 01:00:00      campus    0      mod 62.55
    # 9  2000-09-01 02:00:00      campus    0      mod 63.52
    # 10 2000-09-01 00:00:00 city centre    1      mod 56.69
    # 11 2000-09-01 01:00:00 city centre    1      mod 54.75
    # 12 2000-09-01 02:00:00 city centre    1      mod 54.65
    

Update (to try to address some questions from the comments)

  1. The reshape() documentation is quite confusing. It's best to work through a few examples to get an understanding of how it works. Specifically, "time" does not have to refer to time ("date" in your problem), but is more for, say, panel data, where records are collected at different times for the same ID. In your case, the only "id" in the original data is the "date" column. The other potential "id" is the site, but not in the way the data are organized.

    Imagine, for a moment, if your data looked like this:

    test1 <- structure(list(date = structure(1:3, 
        .Label = c("2000-09-01 00:00:00", 
        "2000-09-01 01:00:00", "2000-09-01 02:00:00"), class = "factor"), 
        obs.campus = c(NA, 52L, 52L), mod.campus = c(61.63, 62.55, 
        63.52), obs.cityCentre = c(66L, NA, 56L), mod.cityCentre = c(56.69, 
        54.75, 54.65)), .Names = c("date", "obs.campus", "mod.campus", 
    "obs.cityCentre", "mod.cityCentre"), class = "data.frame", row.names = c(NA, 
    -3L))
    test1
    #                  date obs.campus mod.campus obs.cityCentre mod.cityCentre
    # 1 2000-09-01 00:00:00         NA      61.63             66          56.69
    # 2 2000-09-01 01:00:00         52      62.55             NA          54.75
    # 3 2000-09-01 02:00:00         52      63.52             56          54.65
    

    Now try reshape(test1, direction = "long", idvar = "date", varying = 2:ncol(test1)). You'll see that reshape() sees the site names as "time" (that can be overridden by adding "timevar = "site"" to your reshape command).

    When direction = "long", you must specify which columns vary with "time". In your case, that is all the columns except for the first, hence my use of 2:ncol(test) for "varying".

  2. test2? Where is that?

  3. Question under @Chase's answer: I think you misunderstand how melt() is supposed to work. Basically, it tries to get you the "skinniest" form of your data. In this case, the skinniest form would be the "optional step" described above since date + site would be the minimum required to comprise a unique ID variable. (I would say that "time" can safely be dropped.)

    Once your data are in the format described in the "optional step" (we'll assume that the output has been stored as "test.melt", you can always easily pivot the table around in different ways. As a demonstration of what I mean by that, try the following and see what they do.

    dcast(test.melt, date + site ~ variable)
    dcast(test.melt, date ~ variable + site)
    dcast(test.melt, variable + site ~ date)
    dcast(test.melt, variable + date ~ site)
    

    It is not easy to have that flexibility if you stop at "Option 1" or "Option 2".


Update (a few years later)

melt from "data.table" can now "melt" multiple columns in a similar way that reshape does. It should work whether or not the column names are duplicated.

You can try the following:

measure <- c("site", "obs", "mod")
melt(as.data.table(test), measure.vars = patterns(measure), value.name = measure)
#                   date variable        site obs   mod
# 1: 2000-09-01 00:00:00        1      campus  NA 61.63
# 2: 2000-09-01 01:00:00        1      campus  52 62.55
# 3: 2000-09-01 02:00:00        1      campus  52 63.52
# 4: 2000-09-01 00:00:00        2 city centre  66 56.69
# 5: 2000-09-01 01:00:00        2 city centre  NA 54.75
# 6: 2000-09-01 02:00:00        2 city centre  56 54.65


回答2:

The fact that you have recurring column names is a bit of an oddity and is not normal R behaviour. Most of the time R forces you to have valid names via the make.names() function. Regardless, I'm able to duplicate your problem. Note I made my own example since yours isn't reproducible, but the logic is the same.

#Do not force unique names
s <- data.frame(id = 1:3, x = runif(3), x = runif(3), check.names = FALSE)
#-----
  id         x         x
1  1 0.6845270 0.5218344
2  2 0.7662200 0.6179444
3  3 0.4110043 0.1104774

#Now try to melt, note that 1/2 of your x-values are missing!
melt(s, id.vars = 1)
#-----
  id variable     value
1  1        x 0.6845270
2  2        x 0.7662200
3  3        x 0.4110043

The solution is to make your column names unique. As I said before, R does this by default in most cases. However, you can do it after the fact via make.unique()

names(s) <- make.unique(names(s))
#-----
[1] "id"  "x"   "x.1"

Note that the second column of x now has a 1 appended to it. Now melt() works as you'd expect:

melt(s, id.vars = 1)
#-----
  id variable     value
1  1        x 0.6845270
2  2        x 0.7662200
3  3        x 0.4110043
4  1      x.1 0.5218344
5  2      x.1 0.6179444
6  3      x.1 0.1104774

At this point, if you want to treat x and x.1 as the same variable, I think a little gsub() or other regex function to get rid of the offending characters. THis is a workflow I use quite often.