Split file occupying the same memory space as sour

2019-02-16 02:37发布

问题:

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts. Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB. But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).

In other words, the input file is the output files.

Is this possible, and if yes, how?

I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.

Programming language doesn't matter, I'm just interested if it would be possible.

回答1:

The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).

None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.



回答2:

As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.

bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))

currentchunk=$((${#chunkoffsets[@]}-1))
while [ $currentchunk -ge 0 ]; do
    # Print current chunk number, so we know it is still running.
    echo -n "$currentchunk "
    offset=${chunkoffsets[$currentchunk]}
    # Copy end of $archive to new file
    tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
    # Chop end of $archive
    truncate -s $offset "$archive"
    currentchunk=$((currentchunk-1))
done

You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.

If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.

OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))

To see which chunks you have set :

for offset in "${chunkoffsets[@]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519

This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.

As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.