efficiently splitting one file into several files

2019-05-30 14:17发布

I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file (call it column k). I want to separate this file into multiple files, putting entries with the same value of k in the same file. How can I do this? For example:

a foo
1 bar
c foo
2 bar
d foo

should be split into a file "foo" containing the entries "a foo" and "c foo" and "d foo" and a file called "bar" containing the entries "1 bar" and "2 bar".

how can I do this in either a shell script or in Python?

thanks.

3条回答
手持菜刀,她持情操
2楼-- · 2019-05-30 14:53

After running both versions of the above awk commands (+ having awk error out) and seeing the request for a python version, I embarked on a short and not particularly arduous journey of writing a utility to easily split files based on keys.

Github repo: https://github.com/gstaubli/split_file_by_key

Background info: http://garrens.com/blog/2015/04/02/split-file-by-keys/

Awk error:

awk: 14 makes too many open files
 input record number 4555369, file part-r-00000
 source line number 1
查看更多
Explosion°爆炸
3楼-- · 2019-05-30 14:56

This should work per your spec

awk '{outFile=$2; print $0 > outFile}' BigManegyFile

Hope this helps.

查看更多
Anthone
4楼-- · 2019-05-30 15:12

I'm not sure how efficient it is, but the quick and easy way is to take advantage of the way file redirection works in awk:

awk '{ print >> $5 }' yourfile

That will append each line (unmodified) into a file named after column 5. Adjust as necessary.

查看更多
登录 后发表回答