This is the table definition for GpsPosition
:
CREATE TABLE GpsPosition
(
altitudeInMeters SMALLINT NOT NULL,
dateCreated BIGINT NOT NULL,
dateRegistered BIGINT NOT NULL,
deviceId BINARY(16) NOT NULL,
emergencyId BINARY(16) NULL,
gpsFix SMALLINT NOT NULL,
heading SMALLINT NOT NULL,
horizontalUncertaintyInMeters SMALLINT NOT NULL,
id BINARY(16) NOT NULL,
latestForDevice BOOLEAN NOT NULL,
latestForUser BOOLEAN NOT NULL,
latitude DOUBLE PRECISION NOT NULL,
longitude DOUBLE PRECISION NOT NULL,
numSatellites SMALLINT NOT NULL,
speedInKmph SMALLINT NOT NULL,
stale BOOLEAN NOT NULL,
userId BINARY(16) NULL,
verticalUncertaintyInMeters SMALLINT NOT NULL,
PRIMARY KEY (id)
);
ALTER TABLE GpsPosition
ADD CONSTRAINT GpsPosition_deviceId_fkey
FOREIGN KEY (deviceId) REFERENCES Device(id)
ON UPDATE CASCADE ON DELETE CASCADE;
ALTER TABLE GpsPosition
ADD CONSTRAINT GpsPosition_emergencyId_fkey
FOREIGN KEY (emergencyId) REFERENCES Emergency(id)
ON UPDATE CASCADE ON DELETE SET NULL;
ALTER TABLE GpsPosition
ADD CONSTRAINT GpsPosition_userId_fkey
FOREIGN KEY (userId) REFERENCES User(id)
ON UPDATE CASCADE ON DELETE SET NULL;
ALTER TABLE GpsPosition
ADD CONSTRAINT deviceId_dateCreated_must_be_unique
UNIQUE (deviceId, dateCreated);
CREATE INDEX i2915035553 ON GpsPosition (deviceId);
CREATE INDEX deviceId_latestForDevice_is_non_unique ON GpsPosition (deviceId, latestForDevice);
CREATE INDEX i3210815937 ON GpsPosition (emergencyId);
CREATE INDEX i1689669068 ON GpsPosition (userId);
CREATE INDEX userId_latestForUser_is_non_unique ON GpsPosition (userId, latestForUser);
Note that userId
in GpsPosition
is a UUID that is stored as a binary(16)
.
This SQL code is executing on AWS AuroraDB engine version 5.7.12.
I would expect the queries below to return the same results, but the first one returns many results and the second returns no results. Any idea as to why?
select *
from GpsPosition
where exists (select *
from User
where id = GpsPosition.userId and
id = UNHEX( '3f4163aab2ac46d6ad15164222aca89e' )
);
select *
from GpsPosition
where userId = UNHEX( '3f4163aab2ac46d6ad15164222aca89e' );
Note that the following SQL statement returns a single row, as you would expect:
select *
from User
where id = UNHEX( '3f4163aab2ac46d6ad15164222aca89e' );
I see no semantic equivalence at all.
The one with exists
is checking to see if a row exists in another table. If no such matching row exists, then the outer query does not return anything.
That is very different from just returning a matching row in a single table.
The observation that two queries return the same results on a particular set of data does not make them semantically equivalent. They would have to be guaranteed to return the same results on any appropriate data for the query. For instance, 2 + 2 = 2 * 2
, but that doesn't make addition and multiplication "semantically equivalent."
I should also add that it is not hard to fool database optimizers, even when two expressions are guaranteed to be equivalent.
So my team has spent literally a couple of months trying to understand this issue and many other inconsistencies (like this one in this posting) we were able to reproduce on AWS Aurora DB 5.7 but unable to reproduce on MySQL 5.7 or anything else for that matter.
As a part of this effort, we engaged AWS support, which was remarkably unhelpful. They confirmed they could reproduce the inconsistencies by executing the same queries we did on the same database we did, but then said they couldn't copy that data to another database and still reproduce the issue, and this seemed to satisfy them to mark the support case as resolved. Now granted, this is a very insidious defect since it is so difficult to reproduce and so intermittent and rare, but when it is hit, it becomes reliably reproducible within the affected data set. And once you do hit this defect, well, your applications depending on the database can no longer operate correctly in those affected areas ;)
While we do not believe the defect is limited to cascade deletes, it appears that a way to "more reliably" produce this defect is to delete rows in tables that have cascade deletes. Again, this appears to produce the defect "more reliably", but even then, it is incredibly rare and difficult to produce. We could produce it by running a huge automated test suite in a tight loop however. Again, once you actually do hit this defect, the affected data will reliably reproduce inconsistencies - it's just VERY hard to hit this defect.
So what conclusions did we draw at the end of all of our analysis?
1) First and foremost, Thorsten Kettner (see his posted comment above) is correct - this is a defect in the RDBMS server itself. We don't have access to the AWS AuroraDB source code or underlying infrastructure, and so we cannot root cause this defect to something much more specific, but it is a defect possibly in the RDBMS server, possibly in the data persistence layer, and possibly somewhere else.
2) Based upon (1) above, we decided that AWS Amazon 5.7.x is not mature enough for us to use for a production application. Even though it works correctly 99.9999% of the time, that 0.0001% was causing development and production database servers to do the wrong things and return incorrect results, which is absolutely unacceptable to us. We also detected cases where integrity constraints on the tables were not reliably honored, resulting in very strange orphaned rows that should have been deleted as a part of cascade deletes in the schema definition, which again, is absolutely unacceptable to us.
3) We were unable to reproduce any of these inconsistencies on AWS MySQL 5.6, AWS MySQL 5.7, AWS AuroraDB with MySQL 5.6 compatibility, non-AWS Windows MySQL 5.6, or non-AWS MySQL 5.7. In short, we believe that whatever is going wrong is specific to AWS AuroraDB with MySQL 5.7 compatibility. We did extensive testing on AWS AuroraDB with MySQL 5.6 compatibility in particular and could not reproduce any of these inconsistency defects, so we believe at this time that AuroraDB with MySQL 5.6 compatibility is mature and suitable for production use.