How do I convert a directory path to a unique nume

2019-06-27 05:51发布

问题:

I am investigating ways to take a directory (folder) and derive some form of unique numerical identifier. I have investigated "string to hash" methods, however, the Pigeon Hole Principle means that one can never derive a truely unique number for every single string.

String to unique hash is no good.

I have recently been investigating other means of achieving my goal and thus have the following question to ask:

Directory time stamps - how 'unique' are they? To what resolution are the time stamps reported by 'stat' as described here (second post)? if the resolution is small enough, is it possible for more than one folder to share the exact same time stamp on a Linux system?

If anyone has other methods/techniques they'd like to share, I'd be happy to listen :)

Edit 1 To clarify my use case in response to the answers posted so far: I am working on Android platforms, so the filesystem is not linked to any other (except of course for removeable media such as Micro SD cards).

I am inserting each path into a database but trying to avoid string comparisons when querying the table. The use of maps/hashmaps is not an option here. Yes, the path itself is unique, but ideally I need a numerical identifier that can be used to query the table as opposed to the path itself. The identifier must also be unique per path. I have experimented with std::collate but found there were many collides in the hashes (a dataset of 20, 000 paths yeilds approximatley 100 collides). What was even more surprising is that the hashes appeared to be largely different each time my application is run. I wonder if it's seeded somehow?

Many thanks, P

回答1:

On any UNIX-based system, you can use the inode number as a unique identifier within that file system. Combining it with the device number will make it unique within the machine. If you wanted it to be globally unique, you could throw in the system's primary MAC address.

Keep in mind, however, that:

  1. The inode number will "follow" the directory if it is moved or renamed. It will change if the directory is deleted and replaced.

  2. The inode number will not be stable across systems, beyond one or two really special directories. (For instance, / is usually inode 2.)



回答2:

+1 duskwuff, good one!

Another way is to simply treat the dir's path as a number ("BigInt").

Take this dir for example: /opt/www/log
It's 12 chars long.
12 * 8bits = 96bits
Thus you have a 96 bits long number,which you can represent in hex/base64/anything (in case you need to pass it as an HTML link).

I'd personally go for duskwuff's approach though.



回答3:

I think it depends very much on the purpose of why you want a unique numerical identifier. Timestamps can change, inodes can change, disknumbers can change, MAC adresses can change. (Still, +1 for duskwuff)

In some scenarios you can simply make a table, where each path you add gets a new, unique number, just like a numerical key column in a database.

Although hashes can collide, in every actual environment this is absolutely unlikely (if you don't use the lousiest algorithm around...) It is much more likely that you get errors due to flaws in your implementation, for example you treat "/tmp" differently than "/tmp/", because you don't normalize the paths before hashing them. Or, you want to distinguish physical folders, but forget to check for hardlinks and symlinks to the same folder, thus you get multiple hashes / id's for the same dir.

Again, depending on your usecase, a collision is not necessarily fatal: if you find that a new path results in the same hash as an existing one (will not happen!) you can still react on that case. (*)

Just to help your imagination: If you use a 64 bit hash, you could fill 150 000 000 of 1TB hard disk drives with empty folders (nothing on them but short folder names...) and then you will have a collision for sure. If you think, that's too risky (blink, blink), use a 128 bit hash which makes it 18 446 744 073 710 000 times less likely.

Hashes are designed to make collisions unlikely, and even good old MD5 will do its job well if nobody willingly tries to produce a collision.

(*) Edit: Your pigeonhole article already points that out: a collision just mean that lookup is no longer O(1), but slightly slower. As it rarely ever happens, you can easily live with that. If you use a std::map (no hash) or std::hashmap, you'll not have to worry about collisions. See what the difference between map and hashmap in STL