I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).
I could do this fairly easily in Python but I'm wondering if there's any kind of ninja way to do this using bash and unix utils (as opposed to manually looping and counting / partitioning lines).
How about the split command?
Use:
Here, 1 and 100 are the line numbers which you will capture in
output.txt
.split
(from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:Thus,
split -n 4 input output.
will generate four files (output.a{a,b,c,d}
) with the same amount of bytes, but lines might be broken in the middle.If we want to preserve full lines (i.e. split by lines), then this should work:
Related answer: https://stackoverflow.com/a/19031247
split the file "file.txt" into 10000 lines files:
HDFS getmerge small file and spilt into property size.
This method will cause line break
split -b 125m compact.file -d -a 3 compact_prefix
I try to getmerge and split into about 128MB every file.
split into 128m ,judge sizeunit is M or G ,please test before use.
Have you looked at the split command?
You could do something like this:
which will create files each with 200000 lines named
xaa xab xac
...Another option, split by size of output file (still splits on line breaks):
creates files like
output_prefix01 output_prefix02 output_prefix03 ...
each of max size 20 megabytes.