I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file (call it column k). I want to separate this file into multiple files, putting entries with the same value of k in the same file. How can I do this? For example:
a foo
1 bar
c foo
2 bar
d foo
should be split into a file "foo" containing the entries "a foo" and "c foo" and "d foo" and a file called "bar" containing the entries "1 bar" and "2 bar".
how can I do this in either a shell script or in Python?
thanks.
I'm not sure how efficient it is, but the quick and easy way is to take advantage of the way file redirection works in awk
:
awk '{ print >> $5 }' yourfile
That will append each line (unmodified) into a file named after column 5
. Adjust as necessary.
This should work per your spec
awk '{outFile=$2; print $0 > outFile}' BigManegyFile
Hope this helps.
After running both versions of the above awk commands (+ having awk error out) and seeing the request for a python version, I embarked on a short and not particularly arduous journey of writing a utility to easily split files based on keys.
Github repo: https://github.com/gstaubli/split_file_by_key
Background info: http://garrens.com/blog/2015/04/02/split-file-by-keys/
Awk error:
awk: 14 makes too many open files
input record number 4555369, file part-r-00000
source line number 1