Amazon S3 architecture [closed]

2019-03-15 19:15发布

问题:

While the post @ http://highscalability.com/amazon-architecture explains Amazon's architecture in general, I am interested in knowing how Amazon S3 is implemented.

Some of my guesses are

  1. A distributed file system like HDFS http://hadoop.apache.org/core/docs/current/hdfs_design.html
  2. A non relational persistent DB like CouchDB http://couchdb.apache.org/

Is it be possible to implement something similar to this on a much smaller scale using scripting languages like Python or PHP?

回答1:

Amazon S3 is implemented using the architecture described in the Dynamo Paper:

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

The paper explains consistent hashing, and how and why the guarantee is "eventual consistency".

The conflict resolution they talk about for Dynamo is not exposed to users of S3. It is used internally in Amazon's applications, but for S3, the only conflict resolution is last write wins.

Edit: Werner Vogels has said "Dynamo is not directly exposed externally as a web service; however, Dynamo and similar Amazon technologies are used to power parts of our Amazon Web Services, such as S3." http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

I would emphasize that he isn't saying S3 and Dynamo share components, he explicitly says that Dynamo itself is one of the technologies that power S3. Everything I've seen from S3, including the caveats, is accounted for by assuming S3 is a fancy web services wrapper around Dynamo with authentication, accounting, and a last-write-wins conflict resolve that is invisible to the user.

The original question was about the underlying storage mechanism for S3. It is explicitly not a distributed file system like HDFS or a non-relational database like CouchDB. Dynamo fills this role.



回答2:

Neither of Amazon S3's architecture nor its implementation has yet been made public. As such, it is not available for extension in order to develop the capability of creating private clouds of any size.

There are a few papers on cloud storage architecture topics. You might find them useful. Here is one: CACSS: Towards a Generic Cloud Storage Service

The method by which different technologies can be combined to provide a single excellent performance, highly scalable and reliable cloud storage system is also detailed. This research serves as a knowledge source for inexperienced cloud providers, giving them the capability of swiftly setting up their own cloud storage services



回答3:

It's closer to 2, although with content stored as "BLOBs" without system caring about contents, whereas CouchDB does. Backend storage uses a local DB (BDB?) for nodes of clusters used to store multiple copies. Reads can go to any node that has a copy, as can writes, but writes need to be resolved to get rid of conflicts. As Kevin mentions, this guarantees "eventual consistency", but gives no strict guarantees of when, or which write wins (from external POV; internally that is defined).

Reading Dynamo docs is useful in understanding many of the concepts, but AFAIK implementation is different. Dynamo is used internally by Amazon for other uses. There are also open source implementations of both; one interesting one is Project Voldemort. CouchDB is obviously very interesting as well.