I have a very large dataset that I need to reshape from wide to long.
My datset looks something like:
COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900
And I want them to look like:
COMPANY PRODUCT DATE REVENUES COSTS
COMPANY A PRODUCT 1 Dec-16 10600 9850
COMPANY A PRODUCT 1 Feb-10 11050 10400
COMPANY A PRODUCT 1 Jan-10 6400 8500
COMPANY A PRODUCT 1 Mar-10 6550 9100
COMPANY A PRODUCT 2 Dec-16 3800 3250
COMPANY A PRODUCT 2 Feb-10 3000 2400
COMPANY A PRODUCT 2 Jan-10 2700 2850
COMPANY A PRODUCT 2 Mar-10 2800 3100
COMPANY B PRODUCT 3 Dec-16 3750 4600
COMPANY B PRODUCT 3 Feb-10 4150 6100
COMPANY B PRODUCT 3 Jan-10 5900 4200
COMPANY B PRODUCT 3 Mar-10 5750 2950
COMPANY B PRODUCT 4 Dec-16 650 500
COMPANY B PRODUCT 4 Feb-10 600 700
COMPANY B PRODUCT 4 Jan-10 550 200
COMPANY B PRODUCT 4 Mar-10 0 100
COMPANY B PRODUCT 5 Dec-16 2100 450
COMPANY B PRODUCT 5 Feb-10 3750 1700
COMPANY B PRODUCT 5 Jan-10 1500 1850
COMPANY B PRODUCT 5 Mar-10 550 3150
COMPANY C PRODUCT 6 Dec-16 21250 23900
COMPANY C PRODUCT 6 Feb-10 17250 26950
COMPANY C PRODUCT 6 Jan-10 19300 18200
COMPANY C PRODUCT 6 Mar-10 23600 18200
In Stata I would type reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string
How do I do this in R?
Here's another option using
tidyverse
andstringr
:The solution using
tidyverse
andstringr
:As of version 1.9.6 (on CRAN 19 Sep 2015),
data.table
can melt multiple columns simultaneously (using thepatterns()
function). So, the columns starting withREVENUES
andCOSTS
can be gathered into two separate columns.In addition, the dates (months) are packed into the columns names without separator. These are extracted from the column names using a regular expression with look-behind and are used to replace the factor levels of the
DATE
column.Edit: Using ISO months naming scheme for proper ordering
Using the naming scheme of alphabetic month name and year does not allow to sort the data by
DATE
properly.DEC2016
come beforeFEB2010
, andFEB2010
beforeJAN2010
. The ISO 8601 nameing convention puts the year first followed by the number of the months.We can use this naming scheme as follows:
Data
The tricky thing here is that you have your dates packed into the column names. Those have to be parsed out before you can make the table as you would like it. I have iterated through each column, parsing each sub table column name for the date and type of observation, binding each sub table, then casting on cost/revenue. I'm sure there is a more elegant solution out there.
As a stata to r convert who loved reshape in stata, I found tidyr::gather and tidyr::spread to be very intuitive. Gather is basically reshape long and spread is reshape wide.
Here is the code that would change your data to the way you want it:
And done!
gather works by taking all the variable names specified (or in this, NOT specified, note the two variables preceded by the '-' sign), and puts them under a new variable whose name is specified by "key = ..." (creating new rows as it goes). It then takes the values that fell under those variables and puts them under a single variable whose name is specified by "value = ...".
spread works in the opposite direction. Hope this helps!
There are several other ways to approach this that would be a bit more streamlined than the "tidyverse" options already suggested.
All of the following examples use the sample data from JMT2080AD's answer with
set.seed(1)
(for reproducibility).Option 1: Base R's
reshape
It's not always the easist function to use, but the
reshape
function is pretty powerful once you figre it out. In this case, you don't have asep
, which makes things a little bit trickier in the sense that you have to be more specific about things like your resulting variable names and the values that should show up as "times" (by default, they would just sequential numbers).That's pretty much what you're looking for, maybe with a bit of difference in the date formatting.
Option 2:
data.table
If performance is what you're after, you can look at
melt
from "data.table", with which you should be able to do something like the following. As with thereshape
approach, you'll need to store the "times" to reintroduce the dates aftermelt
ing the data.(Note: I know this is very similar to @Uwe's approach.)
Option 3:
merged.stack
My "splitstackshape" pacakge has a function called
merged.stack
that tries to make this particular kind of reshaping easier to do. With it, you could try:One day, I'll get around to updating the function, which was written before
melt
in "data.table" could handle a semi-wide output format. I've already come up with a partial solution, but then I stopped fiddling with it.In fact, using the above linked function, the solution would be a simple:
Option 4:
extract
from the "tidyverse"The other solutions using the tidyverse seem to be going about things in a very strange manner. A better solution would be to use
extract
to get the data you need into new columns. You'll have to firstgather
the data into a very long format and thenspread
the data into a wide format.Here's the approach I would use:
I think the most explicit (i.e. no need to rename variables) way to reshape wide to long in R is to use the base R
reshape()
function and and specify the varying columns to be "stacked" as alist
. See this blog post.I'll use the data from JMT2080AD's answer and set the seed to
set.seed(789)
.Using the
list
methodreshape()
inferring which variables should be stackedI've found that even when there are a 100+ variables to reshape, if renaming them can be cumbersome, then using copy/paste to create the varying variables list does not take that long.