Modifying files nested in tar archive

2019-07-23 11:54发布

I am trying to do a grep and then a sed to search for specific strings inside files, which are inside multiple tars, all inside one master tar archive. Right now, I modify the files by

  1. First extracting the master tar archive.
  2. Then extracting all the tars inside it.
  3. Then doing a recursive grep and then sed to replace a specific string in files.
  4. Finally packaging everything again into tar archives, and all the archives inside the master archive.

Pretty tedious. How do I do this automatically using shell scripting?

2条回答
啃猪蹄的小仙女
2楼-- · 2019-07-23 12:20

You probably can sed the actual tar as tar itself does not do compression itself.

e.g.

zcat archive.tar.gz|sed -e 's/foo/bar/g'|gzip > archive2.tar.gz

However, beware that this will also replace foo with bar also in filenames, usernames and group names and ONLY works if foo and bar are of equal length

查看更多
做个烂人
3楼-- · 2019-07-23 12:26

There isn't going to be much option except automating the steps you outline, for the reasons demonstrated by the caveats in the answer by Kimvais.

tar modify operations

The tar command has some options to modify existing tar files. They are, however, not appropriate for your scenario for multiple reasons, one of them being that it is the nested tarballs that need editing rather than the master tarball. So, you will have to do the work longhand.

Assumptions

Are all the archives in the master archive extracted into the current directory or into a named/created sub-directory? That is, when you run tar -tf master.tar.gz, do you see:

subdir-1.23/tarball1.tar
subdir-1.23/tarball2.tar
...

or do you see:

tarball1.tar
tarball2.tar

(Note that nested tars should not themselves be gzipped if they are to be embedded in a bigger compressed tarball.)

master_repackager

Assuming you have the subdirectory notation, then you can do:

for master in "$@"
do
    tmp=$(pwd)/xyz.$$
    trap "rm -fr $tmp; exit 1" 0 1 2 3 13 15
    cat $master |
    (
    mkdir $tmp
    cd $tmp
    tar -xf -
    cd *        # There is only one directory in the newly created one!
    process_tarballs *
    cd ..
    tar -czf - *   # There is only one directory down here
    ) > new.$master
    rm -fr $tmp
    trap 0
done

If you're working in a malicious environment, use something other than tmp.$$ for the directory name. However, this sort of repackaging is usually not done in a malicious environment, and the chosen name based on process ID is sufficient to give everything a unique name. The use of tar -f - for input and output allows you to switch directories but still handle relative pathnames on the command line. There are likely other ways to handle that if you want. I also used cat to feed the input to the sub-shell so that the top-to-bottom flow is clear; technically, I could improve things by using ) > new.$master < $master at the end, but that hides some crucial information multiple lines later.

The trap commands make sure that (a) if the script is interrupted (signals HUP, INT, QUIT, PIPE or TERM), the temporary directory is removed and the exit status is 1 (not success) and (b) once the subdirectory is removed, the process can exit with a zero status.

You might need to check whether new.$master exists before overwriting it. You might need to check that the extract operation actually extracted stuff. You might need to check whether the sub-tarball processing actually worked. If the master tarball extracts into multiple sub-directories, you need to convert the 'cd *' line into some loop that iterates over the sub-directories it creates.

All these issues can be skipped if you know enough about the contents and nothing goes wrong.

process_tarballs

The second script is process_tarballs; it processes each of the tarballs on its command line in turn, extracting the file, making the substitutions, repackaging the result, etc. One advantage of using two scripts is that you can test the tarball processing separately from the bigger task of dealing with a tarball containing multiple tarballs. Again, life will be much easier if each of the sub-tarballs extracts into its own sub-directory; if any of them extracts into the current directory, make sure you create a new sub-directory for it.

for tarball in "$@"
do
    # Extract $tarball into sub-directory
    tar -xf $tarball
    # Locate appropriate sub-directory.
    (
    cd $subdirectory
    find . -type f -print0 | xargs -0 sed -i 's/name/alternative-name/g'
    )
    mv $tarball old.$tarball
    tar -cf $tarball $subdirectory
    rm -f old.$tarball
done

You should add traps to clean up here, too, so the script can be run in isolation from the master script above and still not leave any intermediate directories around. In the context of the outer script, you might not need to be so careful to preserve the old tarball before the new is created (so rm -f $tarbal instead of the move and remove command), but treated in its own right, the script should be careful not to damage anything.

Summary

  • What you're attempting is not trivial.
  • Debuggability splits the job into two scripts that can be tested independently.
  • Handling the corner cases is much easier when you know what is really in the files.
查看更多
登录 后发表回答