I'm in the process of compiling data from different data sets into one data set for analysis. I'll be doing data exploration, trying different things to find out what regularities may be hidden in the data, so I don't currently have a specific method in mind. Now I'm wondering if I should compile my data into long or wide format.
Which format should I use, and why?
I understand that data can be reshaped from long to wide or vice versa, but the mere existence of this functionality implies that the need to reshape sometimes arises and this need in turn implies that a specific format might be better suited for a certain task. So when do I need which format, and why?
I'm not asking about performance. That has been covered in other questions.
As Roland mentioned, most R functions need it in long format, and it is often easier to process data that way.
But on the other hand, it is easier for people to view and comprehend wide format, especially when it is being input and validated, where human comprehension is important for ensuring quality and accuracy.
So I see that data tends to start out life in wide format, and then become long as it becomes used more for processing. Fortunately converting back and forth is pretty easy nowadays, especially with the
tidyr
package.Hadley Wickham's Tidy Data paper, and the
tidyr
package that is his (latest) implementation of its principles, is a great place to start.The rough answer to the question is that data, during processing, should always be long, and should only be widened for display purposes. Be cautious with this, though, as here "long" refers more to "tidy", rather than the pure long form.
Examples
Take, for example, the
mtcars
dataset. This is already in tidy form, in that each row represents a single observation. So "lengthening" it, to get something like thisis counterproductive;
mpg
andcyl
are not comparable in any meaningful way.Taking the
ChickWeight
dataset (which is in long form) and transforming it to wide by timegives a visualization that may be useful, but for data analysis purposes, is very inconvenient, as computing things like growth rate become cumbersome.
The answer is imho quite straight forward. By default the long format takes up significantly more space as the new "variable" column needs to be represented as well. However, long data format can compress your data significantly. If you have a very sparse matrix - this is if many columns are NA - you can specify na.rm=true.
Furthermore, it allows more efficient calculations in many cases. But that one you defined out of scope.