Reshape large dataset with multiple columns from w

2019-05-30 14:34发布

I have a very large dataset that I need to reshape from wide to long.

My datset looks something like:

  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900

And I want them to look like:

  COMPANY    PRODUCT    DATE  REVENUES  COSTS
COMPANY A  PRODUCT 1  Dec-16     10600   9850
COMPANY A  PRODUCT 1  Feb-10     11050  10400
COMPANY A  PRODUCT 1  Jan-10      6400   8500
COMPANY A  PRODUCT 1  Mar-10      6550   9100
COMPANY A  PRODUCT 2  Dec-16      3800   3250
COMPANY A  PRODUCT 2  Feb-10      3000   2400
COMPANY A  PRODUCT 2  Jan-10      2700   2850
COMPANY A  PRODUCT 2  Mar-10      2800   3100
COMPANY B  PRODUCT 3  Dec-16      3750   4600
COMPANY B  PRODUCT 3  Feb-10      4150   6100
COMPANY B  PRODUCT 3  Jan-10      5900   4200
COMPANY B  PRODUCT 3  Mar-10      5750   2950
COMPANY B  PRODUCT 4  Dec-16       650    500
COMPANY B  PRODUCT 4  Feb-10       600    700
COMPANY B  PRODUCT 4  Jan-10       550    200
COMPANY B  PRODUCT 4  Mar-10         0    100
COMPANY B  PRODUCT 5  Dec-16      2100    450
COMPANY B  PRODUCT 5  Feb-10      3750   1700
COMPANY B  PRODUCT 5  Jan-10      1500   1850
COMPANY B  PRODUCT 5  Mar-10       550   3150
COMPANY C  PRODUCT 6  Dec-16     21250  23900
COMPANY C  PRODUCT 6  Feb-10     17250  26950
COMPANY C  PRODUCT 6  Jan-10     19300  18200
COMPANY C  PRODUCT 6  Mar-10     23600  18200

In Stata I would type reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string

How do I do this in R?

标签: r reshape
6条回答
冷血范
2楼-- · 2019-05-30 15:13

Here's another option using tidyverse and stringr:

yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                   product = paste("Product", 1:6),
                   REVENUESJan2010 = round(runif(6, 500, 3000)),
                   REVENUESFeb2010 = round(runif(6, 500, 3000)),
                   REVENUESMar2010 = round(runif(6, 500, 3000)),
                   REVENUESApr2010 = round(runif(6, 500, 3000)),
                   COSTSJan2010 = round(runif(6, 500, 3000)),
                   COSTSFeb2010 = round(runif(6, 500, 3000)),
                   COSTSMar2010 = round(runif(6, 500, 3000)),
                   COSTSApr2010 = round(runif(6, 500, 3000)))

The solution using tidyverse and stringr:

library(tidyverse)
library(stringr)

newData <- yourData %>%
   gather(key = rev.cost.date, value, -company, -product) %>%
   mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
   mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
   select(-rev.cost.date) %>%
   spread(value = value, key = finance.type) %>%
   mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))
查看更多
在下西门庆
3楼-- · 2019-05-30 15:13

As of version 1.9.6 (on CRAN 19 Sep 2015), data.table can melt multiple columns simultaneously (using the patterns() function). So, the columns starting with REVENUES and COSTS can be gathered into two separate columns.

In addition, the dates (months) are packed into the columns names without separator. These are extracted from the column names using a regular expression with look-behind and are used to replace the factor levels of the DATE column.

library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() 
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
      COMPANY   PRODUCT      DATE REVENUES COSTS
 1: COMPANY A PRODUCT 1   JAN2010     6400  8500
 2: COMPANY A PRODUCT 2   JAN2010     2700  2850
 3: COMPANY B PRODUCT 3   JAN2010     5900  4200
 4: COMPANY B PRODUCT 4   JAN2010      550   200
 5: COMPANY B PRODUCT 5   JAN2010     1500  1850
 6: COMPANY C PRODUCT 6   JAN2010    19300 18200
 7: COMPANY A PRODUCT 1   FEB2010    11050 10400
 8: COMPANY A PRODUCT 2   FEB2010     3000  2400
 9: COMPANY B PRODUCT 3   FEB2010     4150  6100
10: COMPANY B PRODUCT 4   FEB2010      600   700
11: COMPANY B PRODUCT 5   FEB2010     3750  1700
12: COMPANY C PRODUCT 6   FEB2010    17250 26950
13: COMPANY A PRODUCT 1 MARCH2010     6550  9100
14: COMPANY A PRODUCT 2 MARCH2010     2800  3100
15: COMPANY B PRODUCT 3 MARCH2010     5750  2950
16: COMPANY B PRODUCT 4 MARCH2010        0   100
17: COMPANY B PRODUCT 5 MARCH2010      550  3150
18: COMPANY C PRODUCT 6 MARCH2010    23600 18200
19: COMPANY A PRODUCT 1   DEC2016    10600  9850
20: COMPANY A PRODUCT 2   DEC2016     3800  3250
21: COMPANY B PRODUCT 3   DEC2016     3750  4600
22: COMPANY B PRODUCT 4   DEC2016      650   500
23: COMPANY B PRODUCT 5   DEC2016     2100   450
24: COMPANY C PRODUCT 6   DEC2016    21250 23900
      COMPANY   PRODUCT      DATE REVENUES COSTS

Edit: Using ISO months naming scheme for proper ordering

Using the naming scheme of alphabetic month name and year does not allow to sort the data by DATE properly. DEC2016 come before FEB2010, and FEB2010 before JAN2010. The ISO 8601 nameing convention puts the year first followed by the number of the months.

We can use this naming scheme as follows:

months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
  paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
      COMPANY   PRODUCT    DATE REVENUES COSTS
 1: COMPANY A PRODUCT 1 2010-01     6400  8500
 2: COMPANY A PRODUCT 2 2010-01     2700  2850
 3: COMPANY B PRODUCT 3 2010-01     5900  4200
 4: COMPANY B PRODUCT 4 2010-01      550   200
 5: COMPANY B PRODUCT 5 2010-01     1500  1850
 6: COMPANY C PRODUCT 6 2010-01    19300 18200
 7: COMPANY A PRODUCT 1 2010-02    11050 10400
 8: COMPANY A PRODUCT 2 2010-02     3000  2400
 9: COMPANY B PRODUCT 3 2010-02     4150  6100
10: COMPANY B PRODUCT 4 2010-02      600   700
11: COMPANY B PRODUCT 5 2010-02     3750  1700
12: COMPANY C PRODUCT 6 2010-02    17250 26950
13: COMPANY A PRODUCT 1 2010-03     6550  9100
14: COMPANY A PRODUCT 2 2010-03     2800  3100
15: COMPANY B PRODUCT 3 2010-03     5750  2950
16: COMPANY B PRODUCT 4 2010-03        0   100
17: COMPANY B PRODUCT 5 2010-03      550  3150
18: COMPANY C PRODUCT 6 2010-03    23600 18200
19: COMPANY A PRODUCT 1 2016-12    10600  9850
20: COMPANY A PRODUCT 2 2016-12     3800  3250
21: COMPANY B PRODUCT 3 2016-12     3750  4600
22: COMPANY B PRODUCT 4 2016-12      650   500
23: COMPANY B PRODUCT 5 2016-12     2100   450
24: COMPANY C PRODUCT 6 2016-12    21250 23900
      COMPANY   PRODUCT    DATE REVENUES COSTS

Data

library(data.table)
wide <- data.table(
readr::read_table(
"  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010     REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010     COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900"
))
查看更多
Explosion°爆炸
4楼-- · 2019-05-30 15:23

The tricky thing here is that you have your dates packed into the column names. Those have to be parsed out before you can make the table as you would like it. I have iterated through each column, parsing each sub table column name for the date and type of observation, binding each sub table, then casting on cost/revenue. I'm sure there is a more elegant solution out there.

library(reshape)

## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                       product = paste("Product", 1:6),
                       revenuesJan2010 = round(runif(6, 500, 3000)),
                       revenuesFeb2010 = round(runif(6, 500, 3000)),
                       revenuesMar2010 = round(runif(6, 500, 3000)),
                       revenuesApr2010 = round(runif(6, 500, 3000)),
                       costJan2010 = round(runif(6, 500, 3000)),
                       costFeb2010 = round(runif(6, 500, 3000)),
                       costMar2010 = round(runif(6, 500, 3000)),
                       costApr2010 = round(runif(6, 500, 3000)))

## a function that parses the date from the column name
columnParse <- function(tab){
    colNm   <- names(tab)[3]
    names(tab)[3] <- "value"
    colDate  <- strsplit(colNm, "revenues|cost")[[1]][2]
    colDate  <- gsub("([A-Za-z]+)", "\\1-", colDate)
    tab$date <- colDate
    tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
    return(tab)
}

## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
                        lapply(3:ncol(yourData),
                               function(x) columnParse(yourData[c(1:2, x)])))

## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")
查看更多
啃猪蹄的小仙女
5楼-- · 2019-05-30 15:27

As a stata to r convert who loved reshape in stata, I found tidyr::gather and tidyr::spread to be very intuitive. Gather is basically reshape long and spread is reshape wide.

Here is the code that would change your data to the way you want it:

new_data <- 
gather(data = your-data-frame, 
       key = var_holder,
       value = val_holder,
       -company,
       -product) 

new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)                                     
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)

new_data <- 
    separate(data = new_data,
             col = var_holder,
             into = c("var", "date")) %>%
    spread(key = var,
           value = val_holder)

And done!

gather works by taking all the variable names specified (or in this, NOT specified, note the two variables preceded by the '-' sign), and puts them under a new variable whose name is specified by "key = ..." (creating new rows as it goes). It then takes the values that fell under those variables and puts them under a single variable whose name is specified by "value = ...".

spread works in the opposite direction. Hope this helps!

查看更多
小情绪 Triste *
6楼-- · 2019-05-30 15:33

There are several other ways to approach this that would be a bit more streamlined than the "tidyverse" options already suggested.

All of the following examples use the sample data from JMT2080AD's answer with set.seed(1) (for reproducibility).

Option 1: Base R's reshape

It's not always the easist function to use, but the reshape function is pretty powerful once you figre it out. In this case, you don't have a sep, which makes things a little bit trickier in the sense that you have to be more specific about things like your resulting variable names and the values that should show up as "times" (by default, they would just sequential numbers).

times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long", 
        varying = grep("revenues|cost", names(yourData)), sep = "", 
        v.names = c("revenues", "cost"), timevar = "date", times = times)
#             company   product    date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010     2862 1164  1
# 2.Jan2010 Company A Product 2 Jan2010     2152 1430  2
# 3.Jan2010 Company B Product 3 Jan2010     2073 1932  3
# 4.Jan2010 Company B Product 4 Jan2010      654 2771  4
# 5.Jan2010 Company B Product 5 Jan2010     1015 1004  5
# 6.Jan2010 Company C Product 6 Jan2010      941 2746  6
# ....

That's pretty much what you're looking for, maybe with a bit of difference in the date formatting.

Option 2: data.table

If performance is what you're after, you can look at melt from "data.table", with which you should be able to do something like the following. As with the reshape approach, you'll need to store the "times" to reintroduce the dates after melting the data.

(Note: I know this is very similar to @Uwe's approach.)

library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
     value.name = c("revenues", "cost"))[
       , variable := factor(variable, labels = times)][]
#       company   product variable revenues cost
#  1: Company A Product 1  Jan2010     1164 1168
#  2: Company A Product 2  Jan2010     1430 1465
#  3: Company B Product 3  Jan2010     1932  533
#  4: Company B Product 4  Jan2010     2771 1456
#  5: Company B Product 5  Jan2010     1004 2674
# ---                                           
# 20: Company A Product 2  Apr2010     2444 1883
# 21: Company B Product 3  Apr2010     2837 1824
# 22: Company B Product 4  Apr2010     1030 2473
# 23: Company B Product 5  Apr2010     2129  558
# 24: Company C Product 6  Apr2010      814 1693

Option 3: merged.stack

My "splitstackshape" pacakge has a function called merged.stack that tries to make this particular kind of reshaping easier to do. With it, you could try:

library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
#       company   product .time_1 revenues cost
#  1: Company A Product 1 Apr2010     1450 2457
#  2: Company A Product 1 Feb2010     2862 1705
#  3: Company A Product 1 Jan2010     1164 1168
#  4: Company A Product 1 Mar2010     2218 2486
#  5: Company A Product 2 Apr2010     2444 1883
#  6: Company A Product 2 Feb2010     2152 1999
#  7: Company A Product 2 Jan2010     1430 1465
#  8: Company A Product 2 Mar2010     1460  770
#  9: Company B Product 3 Apr2010     2837 1824
# 10: Company B Product 3 Feb2010     2073 1734
# ... 

One day, I'll get around to updating the function, which was written before melt in "data.table" could handle a semi-wide output format. I've already come up with a partial solution, but then I stopped fiddling with it.

In fact, using the above linked function, the solution would be a simple:

ReshapeLong_(yourData, c("revenues", "cost"))

Option 4: extract from the "tidyverse"

The other solutions using the tidyverse seem to be going about things in a very strange manner. A better solution would be to use extract to get the data you need into new columns. You'll have to first gather the data into a very long format and then spread the data into a wide format.

Here's the approach I would use:

library(tidyverse)
yourData %>% 
  gather(var, val, -company, -product) %>%
  extract(var, into = c("type", "month", "year"), 
          regex = ("(revenues|cost)(...)(.*)")) %>%
  spread(type, val)
#      company   product month year cost revenues
# 1  Company A Product 1   Apr 2010 2457     1450
# 2  Company A Product 1   Feb 2010 1705     2862
# 3  Company A Product 1   Jan 2010 1168     1164
# 4  Company A Product 1   Mar 2010 2486     2218
# 5  Company A Product 2   Apr 2010 1883     2444
# 6  Company A Product 2   Feb 2010 1999     2152
# ...
查看更多
孤傲高冷的网名
7楼-- · 2019-05-30 15:39

I think the most explicit (i.e. no need to rename variables) way to reshape wide to long in R is to use the base R reshape() function and and specify the varying columns to be "stacked" as a list. See this blog post.

I'll use the data from JMT2080AD's answer and set the seed to set.seed(789).

### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010",   "revenuesFeb2010",  "revenuesMar2010",  "revenuesApr2010"), # revenues
                     c("costJan2010",   "costFeb2010",  "costMar2010",  "costApr2010")) # cost 
### reshape wide to long
reshape(yourData,                      #dataframe
        direction="long",             #wide to long
        varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
        timevar="date",              #the repeated measures times
        v.names=c("revenues", "cost")) #the repeated measures names

#     company   product date   revenues cost id
# 1.1 Company A Product 1    1     2250 1574  1
# 2.1 Company A Product 2    1      734 1793  2
# 3.1 Company B Product 3    1      530 1282  3
# 4.1 Company B Product 4    1     1979 1741  4
# 5.1 Company B Product 5    1     1730 2558  5
# 6.1 Company C Product 6    1      550 1757  6
# 1.2 Company A Product 1    2     1932 1048  1
#...
# 5.3 Company B Product 5    3      890 1103  5
# 6.3 Company C Product 6    3     2113 2469  6
# 1.4 Company A Product 1    4     2426 2382  1
# 2.4 Company A Product 2    4      778 2995  2
# 3.4 Company B Product 3    4     1359  989  3
# 4.4 Company B Product 4    4     1618  912  4
# 5.4 Company B Product 5    4      895 2109  5
# 6.4 Company C Product 6    4     1258 2803  6

Using the list method

  • you do not have to rename variables
  • since the variables you want to create are explicitly defined in the list, there are not errors having to do with reshape() inferring which variables should be stacked

I've found that even when there are a 100+ variables to reshape, if renaming them can be cumbersome, then using copy/paste to create the varying variables list does not take that long.

查看更多
登录 后发表回答