Linux: compute a single hash for a given folder &

2019-01-30 04:01发布

Surely there must be a way to do this easily!

I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.

I need to generate a single hash for the entire contents of a folder (not just the filenames).

I'd like to do something like

sha1sum /folder/of/stuff > singlehashvalue

Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.

标签: linux bash hash
14条回答
家丑人穷心不美
2楼-- · 2019-01-30 04:21

Here's a simple, short variant in Python 3 that works fine for small-sized files (e.g. a source tree or something, where every file individually can fit into RAM easily), ignoring empty directories, based on the ideas from the other solutions:

import os, hashlib

def hash_for_directory(path, hashfunc=hashlib.sha1):                                                                                            
    filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns)         
    index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames)               
    return hashfunc(index.encode('utf-8')).hexdigest()                          

It works like this:

  1. Find all files in the directory recursively and sort them by name
  2. Calculate the hash (default: SHA-1) of every file (reads whole file into memory)
  3. Make a textual index with "filename=hash" lines
  4. Encode that index back into a UTF-8 byte string and hash that

You can pass in a different hash function as second parameter if SHA-1 is not your cup of tea.

查看更多
做自己的国王
3楼-- · 2019-01-30 04:23

You can do tar -c /path/to/folder | sha1sum

查看更多
倾城 Initia
4楼-- · 2019-01-30 04:26

One possible way would be:

sha1sum path/to/folder/* | sha1sum

If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be

find path/to/folder -type f -print0 | xargs -0 sha1sum | sha1sum

Edit: Good point, it's probably a good thing to sort the list of files, so:

find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

And, finally, if you also need to take account of permissions and empty directories:

(find path/to/folder -type f -print0  | sort -z | xargs -0 sha1sum;
 find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
   xargs -0 stat -c '%n %a') \
| sha1sum

The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

查看更多
爷、活的狠高调
5楼-- · 2019-01-30 04:29

A robust and clean approach

  • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
  • Different approaches for different needs/purpose (all of the below or pick what ever applies):
    • Hash only the entry name of all entries in the directory tree
    • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
    • For a symbolic link, its content is the referent name. Hash it or choose to skip
    • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
    • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
    • Handle large files well(again, mind the RAM)
    • Handle very deep directory trees (mind the open file descriptors)
    • Handle non standard file names
    • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
    • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

An example usage and output of dtreetrawl.

Usage:
  dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
  -h, --help                Show help options

Application Options:
  -t, --terse               Produce a terse output; parsable.
  -j, --json                Output as JSON
  -d, --delim=:             Character or string delimiter/separator for terse output(default ':')
  -l, --max-level=N         Do not traverse tree beyond N level(s)
  --hash                    Enable hashing(default is MD5).
  -c, --checksum=md5        Valid hashing algorithms: md5, sha1, sha256, sha512.
  -R, --only-root-hash      Output only the root hash. Blank line if --hash is not set
  -N, --no-name-hash        Exclude path name while calculating the root checksum
  -F, --no-content-hash     Do not hash the contents of the file
  -s, --hash-symlink        Include symbolic links' referent name while calculating the root checksum
  -e, --hash-dirent         Include hash of directory entries while calculating root checksum

A snippet of human friendly output:

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
        Base name                    : CREDITS
        Level                        : 1
        Type                         : regular file
        Referent name                :
        File size                    : 98443 bytes
        I-node number                : 290850
        No. directory entries        : 0
        Permission (octal)           : 0644
        Link count                   : 1
        Ownership                    : UID=0, GID=0
        Preferred I/O block size     : 4096 bytes
        Blocks allocated             : 200
        Last status change           : Tue, 21 Nov 17 21:28:18 +0530
        Last file access             : Thu, 28 Dec 17 00:53:27 +0530
        Last file modification       : Tue, 21 Nov 17 21:28:18 +0530
        Hash                         : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
        Elapsed time     : 1.305767 s
        Start time       : Sun, 07 Jan 18 03:42:39 +0530
        Root hash        : 434e93111ad6f9335bb4954bc8f4eca4
        Hash type        : md5
        Depth            : 8
        Total,
                size           : 66850916 bytes
                entries        : 12484
                directories    : 763
                regular files  : 11715
                symlinks       : 6
                block devices  : 0
                char devices   : 0
                sockets        : 0
                FIFOs/pipes    : 0
查看更多
欢心
6楼-- · 2019-01-30 04:29

I had to check into a whole directory for file changes.

But with excluding, timestamps, directory ownerships.

Goal is to get a sum identical anywhere, if the files are identical.

Including hosted into other machines, regardless anything but the files, or a change into them.

md5sum * | md5sum | cut -d' ' -f1

It generate a list of hash by file, then concatenate those hashes into one.

This is way faster than the tar method.

For a stronger privacy in our hashes, we can use sha512sum on the same recipe.

sha512sum * | sha512sum | cut -d' ' -f1

The hashes are also identicals anywhere using sha512sum but there is no known way to reverse it.

查看更多
神经病院院长
7楼-- · 2019-01-30 04:33

There is a python script for that:

http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/

If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.

查看更多
登录 后发表回答