We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by @JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
One in Gnu awk version 4.0 or later as column ordering relies on
for
andPROCINFO["sorted_in"]
. The row and col numbers are read from files:Some performance gain could probably come from moving the
ARGV[3]
processing block to the top berore 1 and 2 and adding anext
to it's end.Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
Even though in If programming languages were countries, which country would each language represent? they say that...
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
With your sample file:
With your sample file, and inputs as variables, split on comma:
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of
awk
overgrep
and others.Best,
Kim Jong‑un
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
PS: This should work with older awk version or non-gnu awk as well.
to refine @anubhava solution we can get rid of searching over 10k values for each row to see if we are on the right row by takeing advantage of the fact the input is already sorted