text file to dataframe with a list column

2019-09-12 11:11发布

问题:

I am trying to read in a text file like this:

exp1 sample1 2 5  
exp2 sample1 2 3 5 7
exp1 sample2 1 2 6

to a dataframe with a list column like this:

tibble(exp = c("exp1", "exp2", "exp3"), 
       sample = c("sample1","sample1","sample2"), 
       listdata = list(list(2,5), list(2,3,5,7), list(1,2,6)))

# A tibble: 3 x 3
    exp  sample   listdata
  <chr>   <chr>     <list>
1  exp1 sample1 <list [2]>
2  exp2 sample1 <list [4]>
3  exp3 sample2 <list [3]>

The purpose is to use the metadata in the first two columns to select and operate on the lists.

I can read in the lines as lists, but don't know how to separate the metadata:

listdata <- read_lines("list_c_data.txt") %>% strsplit(., " ") %>% tibble()

Any suggestions? I may need to read in the file line by line since the number of observations could be >100000 and the length of the list in each row could be >1000

回答1:

We read the file using read.table/read.csv with fill = TRUE, then gather (from tidyr) the 3rd to last column of the dataset to reshape it to 'long' format, grouped by 'V1' and 'V2', we summarise the 'Val' as a list and then rename the columns if necessary.

library(dplyr)
library(tidyr)
df1 <- read.table("yourfile.txt", header=FALSE, fill = TRUE)
gather(df1, Var, Val, V3:ncol(df1), na.rm = TRUE) %>%
         group_by(V1, V2) %>% 
         summarise(Val = list(Val)) %>%
         rename(exp=V1, sample = V2, listdata = Val)             

Or we can scan to read the rows, strsplit by space, convert the elements in 'lst' (filtering out the 1st and 2nd) to numeric while we rbind the 1st and 2nd elements to a data.frame and create the 'lst2' as the third column.

l1 <- trimws(scan("yourfile.txt", what ="", sep="\n", quiet=TRUE))
lst <- strsplit(l1, " ")
lst2 <- lapply(lst, function(x) as.numeric(x[-(1:2)]))
d1 <- setNames(do.call(rbind.data.frame, lapply(lst, 
                function(x) x[1:2])), c("exp", "sample"))
d1$listdata <- lst2