Trouble optimizing MySQL query

2019-08-28 23:23发布

问题:

I'm working on the following query but not sure how to proceed with further optimizations:

SELECT u.id AS userId, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
       3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
  FROM users u
 WHERE u.id IN (
        SELECT uu.id
          FROM users uu
         WHERE uu.latitude      between lat1    and lat2 -- MySQL 5.7 supports Point data type, but it is not indexed in innoDB. We store latitude and longitude as DOUBLE for now
           AND uu.longitude     between lon1    and lon2
           AND uu.dateOfBirth   between maxAge  and minAge -- dates are in millis, therefore maxAge will have a smaller value than minAge and so it needs to go first
     )
   AND IF(gender       is null, TRUE, u.gender = gender)
   AND IF(activityType is null, TRUE, u.preferredActivityType = activityType)
   AND u.accountState = 'A'
   AND u.id != userId
HAVING distanceInMiles < searchRadius ORDER BY distanceInMiles LIMIT pagingStart, pagingLength;


CREATE INDEX `findMatches` ON `users` (`latitude` ASC, `longitude` ASC, `dateOfBirth` ASC) USING BTREE;


The idea here is to have an inner query that identifies qualifying rows based on user location and age, using the covering index specified above. In a table with a few million rows, that narrows them down to a few thousand without requiring a full table scan. Then the resulting rows are tested against more fine-grain conditions, such as gender, availability, etc. - this time a full scan of the reduced resultset is inevitable.

This runs "almost" as expected, EXPLAIN reveals that the inner query does indeed make use of the full key length (3 columns) of the covering index, and the outer query then looks up the rows returned, by PK.


The problem:
The performance is satisfactory while the search range is within a few hundred miles, but when I get to a thousand miles, it starts degrading because the number of users within the specified boundary increases. The problem would also become apparent if the search range remained the same, but the number of users increased by a few orders. Here are the problems that I have identified so far:

  1. MySQL does not currently support LIMIT in inner queries, therefore the inner query will return ALL qualifying userIDs (i.e., thousands), even if the outer query will then limit them to just a dozen or so.
  2. Enabling optimizer_trace and looking at the work behind the scenes suggests that only column latitude of my covering index is used as a range. I'm not sure why that is, especially since EXPLAIN suggests that the full index key length is used.


The question:
How do I solve (1) and (2) above? Before somebody suggests using spatial data types for lat and long, please note that the latest InnoDB engine (MySQL v5.7) does not support spatial indexes, just spatial data types.

回答1:

You can simplify your query to:

SELECT u.id AS userId, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
       3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
  FROM users u
   WHERE u.latitude between lat1 and lat2
    AND u.longitude between lon1 and lon2
    AND u.dateOfBirth between maxAge and minAge
    AND IF(gender is null, TRUE, u.gender = gender)
    AND IF(activityType is null, TRUE, u.preferredActivityType = activityType)
    AND u.accountState = 'A'
    AND u.id != userId
HAVING distanceInMiles < searchRadius
ORDER BY distanceInMiles
LIMIT pagingStart, pagingLength;

and then create index for all columns in where clause, you can play with order of columns in index, start with columns which has less distinct values (like gender, state)



回答2:

I think this has already been covered in other answers. There is a difference between using the data in the index for a query, and using the data in the index to seek to the right solution. The latter is the most efficient use of an index. The former is helpful, but the efficiency is simply in not reading in the data pages.

I think you can improve your query by using exists instead of in. This should allow filtering at the outer level to improve performance of the query:

SELECT u.id AS userId, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
       3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
FROM users u
WHERE EXISTS (SELECT 1
              FROM users uu
              WHERE uu.latitude      between lat1    and lat2  AND
                    uu.longitude     between lon1    and lon2 AND
                    uu.dateOfBirth   between maxAge  and minAge  AND
                    uu.id = u.id
             ) AND
     IF(gender       is null, TRUE, u.gender = gender) AND
     IF(activityType is null, TRUE, u.preferredActivityType = activityType) AND
     u.accountState = 'A' AND
     u.id <> userId
HAVING distanceInMiles < searchRadius
ORDER BY distanceInMiles LIMIT pagingStart, pagingLength;

As a note, the expression IF(gender is null, TRUE, u.gender = gender) is rather non-sensical because it always evaluates to true. If you have a variable called gender, it will not be used in this expression. The gender will be interpreted according to MySQL scoping rules and be a column in a table. You should always use a prefix like var_ or p_ or something to distinguish parameters from columns in the table.

EDIT:

I should have mentioned that the index needs to include id as the first column to use with exists.