可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a very large dataset that I need to reshape from wide to long.
My datset looks something like:
COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900
And I want them to look like:
COMPANY PRODUCT DATE REVENUES COSTS
COMPANY A PRODUCT 1 Dec-16 10600 9850
COMPANY A PRODUCT 1 Feb-10 11050 10400
COMPANY A PRODUCT 1 Jan-10 6400 8500
COMPANY A PRODUCT 1 Mar-10 6550 9100
COMPANY A PRODUCT 2 Dec-16 3800 3250
COMPANY A PRODUCT 2 Feb-10 3000 2400
COMPANY A PRODUCT 2 Jan-10 2700 2850
COMPANY A PRODUCT 2 Mar-10 2800 3100
COMPANY B PRODUCT 3 Dec-16 3750 4600
COMPANY B PRODUCT 3 Feb-10 4150 6100
COMPANY B PRODUCT 3 Jan-10 5900 4200
COMPANY B PRODUCT 3 Mar-10 5750 2950
COMPANY B PRODUCT 4 Dec-16 650 500
COMPANY B PRODUCT 4 Feb-10 600 700
COMPANY B PRODUCT 4 Jan-10 550 200
COMPANY B PRODUCT 4 Mar-10 0 100
COMPANY B PRODUCT 5 Dec-16 2100 450
COMPANY B PRODUCT 5 Feb-10 3750 1700
COMPANY B PRODUCT 5 Jan-10 1500 1850
COMPANY B PRODUCT 5 Mar-10 550 3150
COMPANY C PRODUCT 6 Dec-16 21250 23900
COMPANY C PRODUCT 6 Feb-10 17250 26950
COMPANY C PRODUCT 6 Jan-10 19300 18200
COMPANY C PRODUCT 6 Mar-10 23600 18200
In Stata I would type reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string
How do I do this in R?
回答1:
There are several other ways to approach this that would be a bit more streamlined than the "tidyverse" options already suggested.
All of the following examples use the sample data from JMT2080AD's answer with set.seed(1)
(for reproducibility).
Option 1: Base R's reshape
It's not always the easist function to use, but the reshape
function is pretty powerful once you figre it out. In this case, you don't have a sep
, which makes things a little bit trickier in the sense that you have to be more specific about things like your resulting variable names and the values that should show up as "times" (by default, they would just sequential numbers).
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long",
varying = grep("revenues|cost", names(yourData)), sep = "",
v.names = c("revenues", "cost"), timevar = "date", times = times)
# company product date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010 2862 1164 1
# 2.Jan2010 Company A Product 2 Jan2010 2152 1430 2
# 3.Jan2010 Company B Product 3 Jan2010 2073 1932 3
# 4.Jan2010 Company B Product 4 Jan2010 654 2771 4
# 5.Jan2010 Company B Product 5 Jan2010 1015 1004 5
# 6.Jan2010 Company C Product 6 Jan2010 941 2746 6
# ....
That's pretty much what you're looking for, maybe with a bit of difference in the date formatting.
Option 2: data.table
If performance is what you're after, you can look at melt
from "data.table", with which you should be able to do something like the following. As with the reshape
approach, you'll need to store the "times" to reintroduce the dates after melt
ing the data.
(Note: I know this is very similar to @Uwe's approach.)
library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
value.name = c("revenues", "cost"))[
, variable := factor(variable, labels = times)][]
# company product variable revenues cost
# 1: Company A Product 1 Jan2010 1164 1168
# 2: Company A Product 2 Jan2010 1430 1465
# 3: Company B Product 3 Jan2010 1932 533
# 4: Company B Product 4 Jan2010 2771 1456
# 5: Company B Product 5 Jan2010 1004 2674
# ---
# 20: Company A Product 2 Apr2010 2444 1883
# 21: Company B Product 3 Apr2010 2837 1824
# 22: Company B Product 4 Apr2010 1030 2473
# 23: Company B Product 5 Apr2010 2129 558
# 24: Company C Product 6 Apr2010 814 1693
Option 3: merged.stack
My "splitstackshape" pacakge has a function called merged.stack
that tries to make this particular kind of reshaping easier to do. With it, you could try:
library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
# company product .time_1 revenues cost
# 1: Company A Product 1 Apr2010 1450 2457
# 2: Company A Product 1 Feb2010 2862 1705
# 3: Company A Product 1 Jan2010 1164 1168
# 4: Company A Product 1 Mar2010 2218 2486
# 5: Company A Product 2 Apr2010 2444 1883
# 6: Company A Product 2 Feb2010 2152 1999
# 7: Company A Product 2 Jan2010 1430 1465
# 8: Company A Product 2 Mar2010 1460 770
# 9: Company B Product 3 Apr2010 2837 1824
# 10: Company B Product 3 Feb2010 2073 1734
# ...
One day, I'll get around to updating the function, which was written before melt
in "data.table" could handle a semi-wide output format. I've already come up with a partial solution, but then I stopped fiddling with it.
In fact, using the above linked function, the solution would be a simple:
ReshapeLong_(yourData, c("revenues", "cost"))
Option 4: extract
from the "tidyverse"
The other solutions using the tidyverse seem to be going about things in a very strange manner. A better solution would be to use extract
to get the data you need into new columns. You'll have to first gather
the data into a very long format and then spread
the data into a wide format.
Here's the approach I would use:
library(tidyverse)
yourData %>%
gather(var, val, -company, -product) %>%
extract(var, into = c("type", "month", "year"),
regex = ("(revenues|cost)(...)(.*)")) %>%
spread(type, val)
# company product month year cost revenues
# 1 Company A Product 1 Apr 2010 2457 1450
# 2 Company A Product 1 Feb 2010 1705 2862
# 3 Company A Product 1 Jan 2010 1168 1164
# 4 Company A Product 1 Mar 2010 2486 2218
# 5 Company A Product 2 Apr 2010 1883 2444
# 6 Company A Product 2 Feb 2010 1999 2152
# ...
回答2:
The tricky thing here is that you have your dates packed into the column names. Those have to be parsed out before you can make the table as you would like it. I have iterated through each column, parsing each sub table column name for the date and type of observation, binding each sub table, then casting on cost/revenue. I'm sure there is a more elegant solution out there.
library(reshape)
## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
product = paste("Product", 1:6),
revenuesJan2010 = round(runif(6, 500, 3000)),
revenuesFeb2010 = round(runif(6, 500, 3000)),
revenuesMar2010 = round(runif(6, 500, 3000)),
revenuesApr2010 = round(runif(6, 500, 3000)),
costJan2010 = round(runif(6, 500, 3000)),
costFeb2010 = round(runif(6, 500, 3000)),
costMar2010 = round(runif(6, 500, 3000)),
costApr2010 = round(runif(6, 500, 3000)))
## a function that parses the date from the column name
columnParse <- function(tab){
colNm <- names(tab)[3]
names(tab)[3] <- "value"
colDate <- strsplit(colNm, "revenues|cost")[[1]][2]
colDate <- gsub("([A-Za-z]+)", "\\1-", colDate)
tab$date <- colDate
tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
return(tab)
}
## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
lapply(3:ncol(yourData),
function(x) columnParse(yourData[c(1:2, x)])))
## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")
回答3:
Here's another option using tidyverse
and stringr
:
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
product = paste("Product", 1:6),
REVENUESJan2010 = round(runif(6, 500, 3000)),
REVENUESFeb2010 = round(runif(6, 500, 3000)),
REVENUESMar2010 = round(runif(6, 500, 3000)),
REVENUESApr2010 = round(runif(6, 500, 3000)),
COSTSJan2010 = round(runif(6, 500, 3000)),
COSTSFeb2010 = round(runif(6, 500, 3000)),
COSTSMar2010 = round(runif(6, 500, 3000)),
COSTSApr2010 = round(runif(6, 500, 3000)))
The solution using tidyverse
and stringr
:
library(tidyverse)
library(stringr)
newData <- yourData %>%
gather(key = rev.cost.date, value, -company, -product) %>%
mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
select(-rev.cost.date) %>%
spread(value = value, key = finance.type) %>%
mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))
回答4:
As of version 1.9.6 (on CRAN 19 Sep 2015), data.table
can melt multiple columns simultaneously (using the patterns()
function). So, the columns starting with REVENUES
and COSTS
can be gathered into two separate columns.
In addition, the dates (months) are packed into the columns names without separator. These are extracted from the column names using a regular expression with look-behind and are used to replace the factor levels of the DATE
column.
library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit()
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
COMPANY PRODUCT DATE REVENUES COSTS
1: COMPANY A PRODUCT 1 JAN2010 6400 8500
2: COMPANY A PRODUCT 2 JAN2010 2700 2850
3: COMPANY B PRODUCT 3 JAN2010 5900 4200
4: COMPANY B PRODUCT 4 JAN2010 550 200
5: COMPANY B PRODUCT 5 JAN2010 1500 1850
6: COMPANY C PRODUCT 6 JAN2010 19300 18200
7: COMPANY A PRODUCT 1 FEB2010 11050 10400
8: COMPANY A PRODUCT 2 FEB2010 3000 2400
9: COMPANY B PRODUCT 3 FEB2010 4150 6100
10: COMPANY B PRODUCT 4 FEB2010 600 700
11: COMPANY B PRODUCT 5 FEB2010 3750 1700
12: COMPANY C PRODUCT 6 FEB2010 17250 26950
13: COMPANY A PRODUCT 1 MARCH2010 6550 9100
14: COMPANY A PRODUCT 2 MARCH2010 2800 3100
15: COMPANY B PRODUCT 3 MARCH2010 5750 2950
16: COMPANY B PRODUCT 4 MARCH2010 0 100
17: COMPANY B PRODUCT 5 MARCH2010 550 3150
18: COMPANY C PRODUCT 6 MARCH2010 23600 18200
19: COMPANY A PRODUCT 1 DEC2016 10600 9850
20: COMPANY A PRODUCT 2 DEC2016 3800 3250
21: COMPANY B PRODUCT 3 DEC2016 3750 4600
22: COMPANY B PRODUCT 4 DEC2016 650 500
23: COMPANY B PRODUCT 5 DEC2016 2100 450
24: COMPANY C PRODUCT 6 DEC2016 21250 23900
COMPANY PRODUCT DATE REVENUES COSTS
Edit: Using ISO months naming scheme for proper ordering
Using the naming scheme of alphabetic month name and year does not allow to sort the data by DATE
properly. DEC2016
come before FEB2010
, and FEB2010
before JAN2010
. The ISO 8601 nameing convention puts the year first followed by the number of the months.
We can use this naming scheme as follows:
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
COMPANY PRODUCT DATE REVENUES COSTS
1: COMPANY A PRODUCT 1 2010-01 6400 8500
2: COMPANY A PRODUCT 2 2010-01 2700 2850
3: COMPANY B PRODUCT 3 2010-01 5900 4200
4: COMPANY B PRODUCT 4 2010-01 550 200
5: COMPANY B PRODUCT 5 2010-01 1500 1850
6: COMPANY C PRODUCT 6 2010-01 19300 18200
7: COMPANY A PRODUCT 1 2010-02 11050 10400
8: COMPANY A PRODUCT 2 2010-02 3000 2400
9: COMPANY B PRODUCT 3 2010-02 4150 6100
10: COMPANY B PRODUCT 4 2010-02 600 700
11: COMPANY B PRODUCT 5 2010-02 3750 1700
12: COMPANY C PRODUCT 6 2010-02 17250 26950
13: COMPANY A PRODUCT 1 2010-03 6550 9100
14: COMPANY A PRODUCT 2 2010-03 2800 3100
15: COMPANY B PRODUCT 3 2010-03 5750 2950
16: COMPANY B PRODUCT 4 2010-03 0 100
17: COMPANY B PRODUCT 5 2010-03 550 3150
18: COMPANY C PRODUCT 6 2010-03 23600 18200
19: COMPANY A PRODUCT 1 2016-12 10600 9850
20: COMPANY A PRODUCT 2 2016-12 3800 3250
21: COMPANY B PRODUCT 3 2016-12 3750 4600
22: COMPANY B PRODUCT 4 2016-12 650 500
23: COMPANY B PRODUCT 5 2016-12 2100 450
24: COMPANY C PRODUCT 6 2016-12 21250 23900
COMPANY PRODUCT DATE REVENUES COSTS
Data
library(data.table)
wide <- data.table(
readr::read_table(
" COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900"
))
回答5:
I think the most explicit (i.e. no need to rename variables) way to reshape wide to long in R is to use the base R reshape()
function and and specify the varying columns to be "stacked" as a list
. See this blog post.
I'll use the data from JMT2080AD's answer and set the seed to set.seed(789)
.
### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010", "revenuesFeb2010", "revenuesMar2010", "revenuesApr2010"), # revenues
c("costJan2010", "costFeb2010", "costMar2010", "costApr2010")) # cost
### reshape wide to long
reshape(yourData, #dataframe
direction="long", #wide to long
varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
timevar="date", #the repeated measures times
v.names=c("revenues", "cost")) #the repeated measures names
# company product date revenues cost id
# 1.1 Company A Product 1 1 2250 1574 1
# 2.1 Company A Product 2 1 734 1793 2
# 3.1 Company B Product 3 1 530 1282 3
# 4.1 Company B Product 4 1 1979 1741 4
# 5.1 Company B Product 5 1 1730 2558 5
# 6.1 Company C Product 6 1 550 1757 6
# 1.2 Company A Product 1 2 1932 1048 1
#...
# 5.3 Company B Product 5 3 890 1103 5
# 6.3 Company C Product 6 3 2113 2469 6
# 1.4 Company A Product 1 4 2426 2382 1
# 2.4 Company A Product 2 4 778 2995 2
# 3.4 Company B Product 3 4 1359 989 3
# 4.4 Company B Product 4 4 1618 912 4
# 5.4 Company B Product 5 4 895 2109 5
# 6.4 Company C Product 6 4 1258 2803 6
Using the list
method
- you do not have to rename variables
- since the variables you want to create are explicitly defined in the list, there are not errors having to do with
reshape()
inferring which variables should be stacked
I've found that even when there are a 100+ variables to reshape, if renaming them can be cumbersome, then using copy/paste to create the varying variables list does not take that long.
回答6:
As a stata to r convert who loved reshape in stata, I found tidyr::gather and tidyr::spread to be very intuitive. Gather is basically reshape long and spread is reshape wide.
Here is the code that would change your data to the way you want it:
new_data <-
gather(data = your-data-frame,
key = var_holder,
value = val_holder,
-company,
-product)
new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)
new_data <-
separate(data = new_data,
col = var_holder,
into = c("var", "date")) %>%
spread(key = var,
value = val_holder)
And done!
gather works by taking all the variable names specified (or in this, NOT specified, note the two variables preceded by the '-' sign), and puts them under a new variable whose name is specified by "key = ..." (creating new rows as it goes). It then takes the values that fell under those variables and puts them under a single variable whose name is specified by "value = ...".
spread works in the opposite direction. Hope this helps!