Read and cbind second column of multiple files in

2019-07-13 03:48发布

问题:

I have 94 tab delimited files, no header, in a single directory '/path/' with gene names in the first column and counts in the second column. There are 23000 rows.

I would like to read all 94 files found in /path/ in to R and merge all of the 94 files to create a single data frame 'counts.table' where the first column contains the gene names (identical and in the same order in Column 1 of all 94 files) and second to ninety-fifth column contains the counts from each individual file (i.e. Column 2 of each of the 94 files, which are unique numbers). The final counts.table data frame will have 23000 rows and 95 columns.
Ideally like this:

 Column1 Column2 Column3 Column4... to column 95 
 gene a      0      4      3 
 gene b      4      9      9 
 gene c      3      0      8 
 ...
 to row 23000

Column2 contains counts from sample X, Column3 counts from sample Y, column 4 from sample Z, etc.

Do I have to read each file in to R individually and then merge them all by adding the second column of each file with cbind to create 'counts.table'? Thanks in advance.

回答1:

Too long for a comment.

Something like this should work.

# not tested
files <- list.files(path="./path")
genes <- read.table(files[1], header=FALSE, sep="\t")[,1]     # gene names
df    <- do.call(cbind,lapply(files,function(fn)read.table(fn,header=FALSE, sep="\t")[,2]))
df    <- cbind(genes,df)

list.files(...) grabs the names of all the files in the specified path into a vector. We then extract the gene names: column 1 of the first file (could be any of the files). We then build a list of data.frames using lapply(files, function(fn)...) which contains the second column of each file, and bind all these together column-wise using do.call(cbind, ...). Finally, we bind the gene names to the result.

Assumptions:

  1. The gene names are in the same order in all the files.
  2. All the files have exactly the same number of rows.
  3. The path directory has your gene files only.