Extracting columns from text file with different d

2020-01-30 07:32发布

问题:

I have very large genotype files that are basically impossible to open in R, so I am trying to extract the rows and columns of interest using linux command line. Rows are straightforward enough using head/tail, but I'm having difficulty figuring out how to handle the columns.

If I attempt to extract (say) the 100-105th tab or space delimited column using

 cut -c100-105 myfile >outfile

this obviously won't work if there are strings of multiple characters in each column. Is there some way to modify cut with appropriate arguments so that it extracts the entire string within a column, where columns are defined as space or tab (or any other character) delimited?

回答1:

If the command should work with both tabs and spaces as the delimiter I would use awk:

awk '{print $100,$101,$102,$103,$104,$105}' myfile > outfile

As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a for loop:

awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile

If you want to use cut, you need to use the -f option:

cut -f100-105 myfile > outfile

If the field delimiter is different from TAB you need to specify it using -d:

cut -d' ' -f100-105 myfile > outfile

Check the man page for more info on the cut command.



回答2:

You can use cut with a delimiter like this:

with space delim:

cut -d " " -f1-100,1000-1005 infile.csv > outfile.csv

with tab delim:

cut -d$'\t' -f1-100,1000-1005 infile.csv > outfile.csv

I gave you the version of cut in which you can extract a list of intervals...

Hope it helps!



标签: linux