How do I split a file into n no of parts

2020-02-17 09:21发布

问题:

I have a file contining some no of lines. I want split file into n no.of files with particular names. It doesn't matter how many line present in each file. I just want particular no.of files (say 5). here the problem is the no of lines in the original file keep on changing. So I need to calculate no of lines then just split the files into 5 parts. If possible we have to send each of them into different directories.

回答1:

In bash, you can use the split command to split it based on number of lines desired. You can use wc command to figure out how many lines are desired. Here's wc combined with with split into one line.

For example, to split onepiece.log into 5 parts

    split -l$((`wc -l < onepiece.log`/5)) onepiece.log onepiece.split.log -da 4

This will create files like onepiece.split.log0000 ...

Note: bash division rounds down, so if there is a remainder there will be a 6th part file.



回答2:

On linux, there is a split command,

split --lines=1m /path/to/large/file /path/to/output/file/prefix

Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT is -, read standard input.

...

-l, --lines=NUMBER put NUMBER lines per output file

...

You would have to calculate the actual size of the splits beforehand, though.



回答3:

Assuming you are processing a text file then wc -l to determine the total number of lines and split -l to split into a specified number of lines (total / 5 in your case). This works on UNIX/Mac and Windows (if you have cygwin installed)



回答4:

This is building on the original answers given by @sketchytechky and @grasshopper. If you would like to deal with remainders differently and want a fixed number of files as output but with a round robin distribution of lines, then the split command should be written as:

split -da 4 -n r/1024 filename filename_split --additional-suffix=".log". Replace 1024 with the number of files you want as output.



回答5:

I can think of a few ways to do it. Which you would use depends a lot on the data.

  1. Lines are fixed length: Find the size of the file by reading it's directory entry and divide by the line length to get the number of lines. Use this to determine how many lines per file.

  2. The files only need to have approximately the same number of lines. Again read the file size from the directory entry. Read the first N lines (N should be small but some reasonable fraction of the file) to calculate an average line length. Calculate the approximate number of lines based on the file size and predicted average line length. This assumes that the line length follows a normal distribution. If not, adjust your method to randomly sample lines (using seek() or something similar). Rewind the file after your have your average, then split it based on the predicted line length.

  3. Read the file twice. The first time count the number of lines. The second time splitting the file into the requisite pieces.

EDIT: Using a shell script (according to your comments), the randomized version of #2 would be hard unless you wrote a small program to do that for you. You should be able to use ls -l to get the file size, wc -l to count the exact number of lines, and head -nNNN | wc -c to calculate the average line length.



标签: file split