R; DPLYR: Convert a list of dataframes into a sing

2019-08-23 08:08发布

问题:

I have a list with multiple entries, an example entry looks like:

> head(gene_sets[[1]])
     patient Diagnosis Eigen_gene ENSG00000080824 ENSG00000166165 ENSG00000211459 ENSG00000198763 ENSG00000198938 ENSG00000198886
1 689_120604        AD -0.5606425           50137           38263          309298          528233          523420          730537
2 412_120503        AD  0.9454632           44536           23333          404316          730342          765963         1168123
3 706_120605        AD  0.6061834           16647           22021          409498          614314          762878         1171747
4 486_120515        AD  0.8164779           21871            9836          518046          697051          613621         1217262
5 469_120514        AD  0.5354927           33460           11651          468223          653745          608259         1115973
6 369_120502        AD -0.8363372           32168           44760          271978          436132          513194          784537

For these entries, the first three columns are always consistent and the total number of columns varies.

What I would like to do is convert this entire list into a dataframe. The information I need to retain is set_index being the index of entry in the list, then all the colnames from beyond Eigen_gene until the last column.

I can think of solutions using loops, however I would like a dplyr/reshape solution.

To clarify, if we had a fake input that looked like:

> list(data.frame(patient= c(1,2,3), Diagnosis= c("AD","Control", "AD"), Eigen_gene= c(1.1, 2.3, 4.3), geneA= c(1,1,1), geneC= c(2,1,3), geneB= c(2,39,458)))
[[1]]
  patient Diagnosis Eigen_gene geneA geneC geneB
1       1        AD        1.1     1     2     2
2       2   Control        2.3     1     1    39
3       3        AD        4.3     1     3   458

The desired output would look like this (I have only shown an example of the first list entry for input, the output shows how other entries in the list would also be formatted):

> data.frame(set_index= c(1,1,1,2,2,2,3,3), gene= c("geneA", "geneC", "geneB", "geneF", "geneE", "geneH", "geneT", "geneZ"))
  set_index  gene
1         1 geneA
2         1 geneC
3         1 geneB
4         2 geneF
5         2 geneE
6         2 geneH
7         3 geneT
8         3 geneZ

Thanks!

回答1:

Here is a solution from the tidyverse and purrr. I extended the example input to produce the example output. The key function here is imap, which is shorthand for map2(x, seq_along(x)). See the help for more. What we want to do is apply a function to each dataframe in the list and its index. So we use the function ~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)])).

  • ~, .x and .y are purrr shorthands for function(x, y), x and y. This lets us refer to the arguments for the function compactly. See ?map2.
  • set_index = .y creates the first column and fills it with the index of the current dataframe (it's usefully repeated to be the right length)
  • gene = colnames(.x[4:ncol(.x)])) creates the second column from a vector of the gene names. colnames gets the variable names of the data frame, but we subset to exclude the first three.
  • If we had just imap, we would get a list of data frames. The imap_dfr just takes that list and binds them together as rows, producing our desired output. (equivalent to calling bind_rows afterwards)
library(tidyverse)
gene_list <- list(
  data.frame(
    patient= c(1,2,3),
    Diagnosis= c("AD","Control", "AD"),
    Eigen_gene= c(1.1, 2.3, 4.3),
    geneA= c(1,1,1),
    geneC= c(2,1,3),
    geneB= c(2,39,458)
  ),
  data.frame(
    patient= c(1,2,3),
    Diagnosis= c("AD","Control", "AD"),
    Eigen_gene= c(1.1, 2.3, 4.3),
    geneF= c(1,1,1),
    geneE= c(2,1,3),
    geneH= c(2,39,458)
  ),
  data.frame(
    patient= c(1,2,3),
    Diagnosis= c("AD","Control", "AD"),
    Eigen_gene= c(1.1, 2.3, 4.3),
    geneT= c(1,1,1),
    geneZ= c(2,1,3)
  )
)

output <- gene_list %>%
  imap_dfr(~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)])))
output
#> # A tibble: 8 x 2
#>   set_index gene 
#>       <int> <chr>
#> 1         1 geneA
#> 2         1 geneC
#> 3         1 geneB
#> 4         2 geneF
#> 5         2 geneE
#> 6         2 geneH
#> 7         3 geneT
#> 8         3 geneZ

Created on 2018-03-02 by the reprex package (v0.2.0).