Duplicate entries in High Replication Datastore

2019-04-15 22:51发布

We still have a rare case of duplicate entries when this POST method is called. I had asked for advice previously on Stack overflow and was given a solution, that is utilising the parent/child methodology to retain strongly consistent queries.

I have migrated all data into that form and let it run for another 3 months. However the problem was never solved.

The problem is right here with this conditional if recordsdb.count() == 1: It should be true in order to update the entry, but instead HRD might not always find the latest entry and creates a new entry instead.

As you can see, we are writing/reading from the Record via Parent/Child methodology as recommended:

new_record = FeelTrackerRecord(parent=user.key,...)

And yet still upon retrieval, the HRD still doesn't always fetch the latest entry:

recordsdb = FeelTrackerRecord.query(ancestor = user.key).filter(FeelTrackerRecord.record_date == ... )

So we are quite stuck on this and don't know how to solve it.

@requires_auth
    def post(self, ios_sync_timestamp):
        user = User.query(User.email == request.authorization.username).fetch(1)[0]
        if user:
            json_records = request.json['records']
            for json_record in json_records:
                recordsdb = FeelTrackerRecord.query(ancestor = user.key).filter(FeelTrackerRecord.record_date == date_parser.parse(json_record['record_date']))
                if recordsdb.count() == 1:
                    rec = recordsdb.fetch(1)[0]
                    if 'timestamp' in json_record:
                        if rec.timestamp < json_record['timestamp']:
                            rec.rating = json_record['rating']
                            rec.notes = json_record['notes']
                            rec.timestamp = json_record['timestamp']
                            rec.is_deleted = json_record['is_deleted']
                            rec.put()
                elif recordsdb.count() == 0:
                    new_record = FeelTrackerRecord(parent=user.key,
                                        user=user.key, 
                                        record_date = date_parser.parse(json_record['record_date']), 
                                        rating = json_record['rating'], 
                                        notes = json_record['notes'], 
                                        timestamp = json_record['timestamp'])
                    new_record.put()
                else:
                    raise Exception('Got more than two records for the same record date - among REST post')
            user.last_sync_timestamp = create_timestamp(datetime.datetime.today())
            user.put()
            return '', 201
        else:
            return '', 401

Possible Solution:

The very last idea I have to solve this would be, stepping away from Parent/Child strategy and using the user.key PLUS date-string as part of the key.

Saving:

new_record = FeelTrackerRecord(id=str(user.key) + json_record['record_date'], ...)
new_record.put()

Loading:

key = ndb.Key(FeelTrackerRecord, str(user.key) +  json_record['record_date'])
record = key.get();

Now I could check if record is None, I shall create a new entry, otherwise I shall update it. And hopefully HRD has no reason not finding the record anymore. What do you think, is this a guaranteed solution?

1条回答
Summer. ? 凉城
2楼-- · 2019-04-15 23:08

The Possible Solution appears to have the same problem as the original code. Imagine the race condition if two servers execute the same instructions practically simultaneously. With Google's overprovisioning, that is sure to happen once in a while.

A more robust solution should use Transactions and a rollback for when concurrency causes a consistency violation. The User entity should be the parent of its own Entity Group. Increment a records counter field in the User entity within a transaction. Create the new FeelTrackerRecord only if the Transaction completes successfully. Therefore the FeelTrackerRecord entities must have a User as parent.

Edit: In the case of your code the following lines would go before user = User.query(... :

Transaction txn = datastore.beginTransaction();
try {

and the following lines would go after user.put() :

    txn.commit();
} finally {
    if (txn.isActive()) {
        txn.rollback();
    }
}

That may overlook some flow control nesting detail, it is the concept that this answer is trying to describe.

With an active transaction, if multiple processes (for example on multiple servers executing the same POST concurrently because of overprovisioning) the first process will succeed with its put and commit, while the second process will throw the documented ConcurrentModificationException.

Edit 2: The transaction that increments the counter (and may throw an exception) must also create the new record. That way if the exception is thrown, the new record is not created.

查看更多
登录 后发表回答