Why randomize your file names for cloud storage/CD

2020-06-04 09:04发布

问题:

When you look at a profile picture on a social networking site like Twitter, they store image files like:

http://a1.twimg.com/profile_images/1082228637/a-smile_twitter_100.jpg

or even with a date somewhere in the path like 20110912. The only immediate benefit I can think of is preventing a bot from going through and downloading all files in your storage in a linear fashion. Am I missing any other benefits? What is the best way to go about randomizing it?

I am using Amazon S3 so I will have one subdomain serving all my static content. My plan was to store an integer ID in my database and then just concat the URL with the id to form the location.

回答1:

One reason I cryptographically scramble identifiers in public URLs is so that the business' rate of growth is not always public.

If the current ids can be deduced simply by creating a new user account or uploading an image, then an outside person can calculate the growth rate (or an upper limit) by doing this on a regular basis and seeing how many ids were used during the elapsed time.

Whether it's stagnating or whether it's exploding exponentially, I want to be able to control the release of this information instead of letting competitors or business analysts be able to deduce it for themselves.

Offline examples of this are invoice and check numbers. If you get billed by or paid by a company on a regular basis, then you can see how many invoices or checks they write in that time period.

Here's a CPAN (Perl) module I maintain that scrambles 32-bit ids using two way encryption based on SkipJack:

http://metacpan.org/pod/Crypt::Skip32

It's a direct translation of the Skip32 algorithm written in C by Greg Rose:

http://www.qualcomm.com.au/PublicationsDocs/skip32.c

Use of this approach maps each 32-bit id into an (effectively random) corresponding 32-bit number which can be reversed back into the original id. You don't have to save anything extra in your database.

I convert the scrambled id into 8 hex digits for displaying in URLs.

Once your ids approach 4.29 billion (32-bits) you'll need to plan for extending the URL structure to support more, but I like having shorter URLs for as long as possible.



回答2:

Changing URLs is a safe way to invalidate outdated assets.

It is also a necessity if you want to allow users storing private images. Using a path deductible from the users account name/id/path would render privacy settings useless as soon as you store assets on a CDN.



回答3:

Mainly, it prevents name collisions. More than one person might upload "IMG_0001.JPG", for example. You also avoid limits on the number of files in one directory, and you can shard images across multiple servers - there's no way a huge site like Twitter or Facebook could store all photos on one server, no matter how large.