I'm in need of some insight into Instagram's engineering when uploading files to Amazon S3. I'm just starting with S3 and I think Instagram is a good model to follow because they upload thousands of images each day. My app is somewhat similar. Users upload images, can delete their own images, and all images are public.
In my project I'm creating objects with a folder prefix to organize uploads for each user. e.g. username/filename
My object URLs look like this:
https://s3.amazonaws.com/my_bucket/username/28c3d2c6ec098bd077d6b9cb5f13869d.jpg
but Instagram:
http://distilleryimage7.s3.amazonaws.com/f4947c1004ca11e2a0c81231380ff428_7.jpg
I'm guessing that distilleryimage7
is the bucket name. I'm not sure what advantage this type of URL has. I'm also guessing that Instagram doesn't use bucket "files" and stores all images in one bucket.
Please share any best practices in S3.
This URL is actually one that is supported by default by S3. For US and most buckets you can do a special DNS resolution which allows you to use either:
http://my_bucket.my_domain.com
With some changs to your own records or:
http://my_bucket.s3.amazonaws.com
If you don't want to change any of your A records (a small primer: http://docs.amazonwebservices.com/AmazonS3/latest/dev/VirtualHosting.html#VirtualHostingCustomURLs).
The advantages of this type of url is of course the common thought of using subodomains for certain assets to make loading faster in the browser.
Of course this is a fix. One used by other sites such as Facebook, Twitter and Youtube is to use a whole different domain for this kind of stuff. This helps since it is a stripped out domain specifically designed for these assets (no cookies should exist on these domains either).
So this isn't really a best practice of S3 but more of web development in general and covers a much wider view of how to program and layout a site in a production environment.
Yes Instagram would house all files in a huge single bucket, this is most likely the most sane method of doing this and then when you get big you would replicate parts of the buckets and split them across regions and sub regions dependant upon demand or ping them to cloudfront like Vimeo does.
Edit
After reading this further I realised too that Instagram does not house everything in one bucket. A bit weird really, especially since a bucket must be uniquely named across the whole of S3 including other peoples accounts. As such they probably don't use the username directly unless that bucket name hasn't already been taken.
There are huge benefits to doing this though. Like replication per user and cloudfront per user however there are also downsides:
A lot of separate http requests when many users images are shown, fair enough it is all to S3 domain but I am unsure how many subdomains you are allowed for SEO and browsers to take advantage of it (i think 6 in IE6).
Backup and replication can be harder since you would need to do per user not for a single bucket.
Moving buckets to cdn etc can be problematic since you again have to do it per user.
I think I remember seeing a max limit for buckets in S3 so I am unsure how this will scale effectively tbh.