I have the following dataframe:
df = data.frame(A_1 = c(1,2,3), A_2 = c(4,5,6), A_3 = c(7,8,9), B_1 = c(10, 11, 12), B_2 = c(13, 14, 15), B_3 = c(16, 17, 18))
#> df
# A_1 A_2 A_3 B_1 B_2 B_3
#1 1 4 7 10 13 16
#2 2 5 8 11 14 17
#3 3 6 9 12 15 18
The column names contain both a letter and a number. The letter refers to a specific variable (e.g A is a factor, B is a factor), while the numbers in the column names, refer to individuals. In other words, each individual has values for A and B: A_1 and B_1 are columns for Individual 1, and A_2, B_2 are columns for Individual 2, etc.
I would like to achieve the following result: note that all the "A" columns are merge into one "A" column, and the same goes for the "B" columns, etc. :
A B
# 1 10
# 2 11
# 3 12
# 4 13
# 5 14
# 6 15
# 7 16
# 8 17
# 9 18
Is there any easy way to achieve that? Please note that my real dataframe contains more than 20 distinct letter columns (A, B, C, ...), each letter having three subcolumns (e.g: A_1, A_2, A_3).
Thanks!!
I'd
unlist
the relevant columns of adata.frame
. There are many ways to group the columns into unqiue persons (I really like Ananda's for instance), but using regular expressions is another way...You can get the data in the shape you want like this:
That same code should work for a large data frame, as long as there are three columns per individual. Then you just need to assign column names.
This is known as "reshaping" your data from a "wide" format to a "long" format. In base R, one tool is
reshape
, but you'll need an "id" variable first:You can drop the other columns if required.
For fun, here's another approach, using the "reshape2" package (start with your original sample data):
If you live on the bleeding edge, "data.table" version 1.8.11 has now implemented "melt" and "dcast". I haven't played much with it yet, but it is pretty straightforward too. Again, as with all the solutions I've provided so far, an "id" is needed.
Update
Another option is to use
merged.stack
from my "splitstackshape" package. It works nicely if you also useas.data.table(df, keep.rownames = TRUE)
, which would create the equivalent of thedata.table(cbind(id = sequence(nrow(df)), df))
step in the "data.table" approach.And for fairness/completeness, here's an approach with "tidyr" + "dplyr".