I have an S3 bucket containing quite a few S3 objects that multiple EC2 instances can pull from (when scaling horizontally). Each EC2 will pull an object one at a time, process it, and move it to another bucket.
Currently, to make sure the same object isn't processed by multiple EC2 instances, my Java app renames it with a "locked" extension added to its S3 object key. The problem is that "renaming" is actually doing a "move". So the large files in the S3 bucket can take up to several minutes to complete its "rename", resulting in the locking process being ineffective.
Does anyone have a best practice for accomplishing what I'm trying to do?
I considered using SQS, but that "solution" has its own set of problems (order not guaranteed, possibility of messages delivered more than once, and more than one EC2 getting the same message)
I'm wondering if setting a "locked" header would be a quicker "locking" process.
order not guaranteed, possibility of messages delivered more than once, and more than one EC2 getting the same message
The odds of actually getting the same message more than once is low. It's merely "possible," but not very likely. If it's essentially only an annoyance if, on isolated occasions, you should happen to process a file more than once, then SQS seems like an entirely reasonable option.
Otherwise, you'll need an external mechanism.
Setting a "locked" header on the object has a problem of its own -- when you overwrite an object with a copy of itself (that's what happens when you change the metadata -- a new copy of the object is created, with the same key) then you are subject to the slings and arrows of eventual consistency.
Q: What data consistency model does Amazon S3 employ?
Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
https://aws.amazon.com/s3/faqs/
Updating metadata is an "overwrite PUT
." Your new header may not immediately be visible, and if two or more workers set their own unique header (e.g. x-amz-meta-locked: i-12345678) it's entirely possible for a scenario like the following to play out (W1, W2 = Worker #1 and #2):
W1: HEAD object (no lock header seen)
W2: HEAD object (no lock header seen)
W1: set header
W2: set header
W1: HEAD object (sees its own lock header)
W2: HEAD object (sees its own lock header)
The same or a similar failure can occur with several different permutations of timing.
Objects can't be effectively locked in an eventual consistency environment like this.
Object tag can assist here, as changing a tag doesn't create a new copy. Tag is kind of key/value pair associated to object. i.e. you need to use object level tagging.