To Split into fixed sequences and leave extra out

2020-08-02 04:48发布

问题:

I would like to limit all files to be of the same fixed length but the last item can be any variable size but not more than 557. This means that the file amount can be more than determined by the flag -n of the command split.

Code 1 (ok)

$ seq -w 1 1671 > /tmp/k && gsplit -n15 /tmp/k && wc -c xaa && wc -c xao
557 xaa
557 xao

where xaa is the first file of the sequence, while xao the last one. I increase the sequence by one unit but it causes 5 unit increase (557->562) in the last file xao which I do not understand:

$ seq -w 1 1672 > /tmp/k && gsplit -n15 /tmp/k && wc -c xaa && wc -c xao
557 xaa
562 xao

Why does the increase of one-unit in sequence increase the last item (xao) by 5 units?

Code 2

$ seq -w 1 1671 | gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
445 xaa
455 xao
$ seq -w 1 1672 | gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
445 xaa
459 xao

so increasing the whole length by one sequence (4 characters) leads to 4 character increase (455 -> 459), in contrast to the first code where increase is 5 characters.

Code 3

Let's now keep each unit of sequence fixed to 4 characters by seq -w 0 0.0001 1 | gsed 's/\.//g':

$ seq -w 0 0.0001 1 | gsed 's/\.//g' |  gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
3333 xaa
3344 xao
$ seq -w 0 0.0001 1.0001 | gsed 's/\.//g' |  gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
3334 xaa
3335 xao

so increasing the sequence by one characters increases xaa by unit but decreases xao by 9 units. This behavior is what I do not keep so logical.

How can you limit the sequence length first, for instance to be fixed at 557 and later determine the amount of files of successful files?

回答1:

Original answer — for Code 1

Because seq -w 1 1671 generates 5 characters per number — 4 digits and 1 newline. So adding one number to the output adds 5 bytes to the output.

Extra answer — for Code 2

You've asked GNU split (aka gsplit) to split the file input into 15 chunks. It does its best to even the values out. But there's a limit to what it can do when the total number of bytes is not a multiple of 15. There are options to control what happens.

However, in the basic form, the -n 15 option means that the first 14 output files each get 445 characters, and the last gets 455 because there are 6685 = 445 * 15 + 10 characters in the output file. When you add another 4 characters to the file (because you delete the newlines), then the last file gets an additional 4 characters (because 6689 = 445 * 15 + 14).

Extra answer — for Code 3

First of all, the output from seq -w 0 0.0001 1 looks like:

0.0000
0.0001
0.0002
…
0.9998
0.9999
1.0000

So after the output is edited with the first sed, the numbers from 00000 to 10000 are present, one per line, with 6 characters per line (including the newline). The second sed eliminates the newlines, again.

There are 50006 bytes in /tmp/k on one line. That's equal to 15 * 3333 + 11, hence the first output. The second variant has 50011 bytes in /tmp/k, which is 15 * 3334 + 1. Hence the difference of only one.



标签: shell unix split