I have very large genotype files that are basically impossible to open in R, so I am trying to extract the rows and columns of interest using linux command line. Rows are straightforward enough using head/tail, but I'm having difficulty figuring out how to handle the columns.
If I attempt to extract (say) the 100-105th tab or space delimited column using
cut -c100-105 myfile >outfile
this obviously won't work if there are strings of multiple characters in each column. Is there some way to modify cut with appropriate arguments so that it extracts the entire string within a column, where columns are defined as space or tab (or any other character) delimited?
If the command should work with both tabs and spaces as the delimiter I would use awk
:
awk '{print $100,$101,$102,$103,$104,$105}' myfile > outfile
As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a for
loop:
awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile
If you want to use cut
, you need to use the -f
option:
cut -f100-105 myfile > outfile
If the field delimiter is different from TAB
you need to specify it using -d
:
cut -d' ' -f100-105 myfile > outfile
Check the man page for more info on the cut command.
You can use cut with a delimiter like this:
with space delim:
cut -d " " -f1-100,1000-1005 infile.csv > outfile.csv
with tab delim:
cut -d$'\t' -f1-100,1000-1005 infile.csv > outfile.csv
I gave you the version of cut in which you can extract a list of intervals...
Hope it helps!