-->

MySQL Query to find most similar numerical row

2020-07-26 11:03发布

问题:

In a MySQL database, I am attempting to find the most similar row across a number of numerical attributes. This problem is similar to this question but includes a flexible number of comparisons and a join table.

Database

The database consists of two tables. The first table, users, is what I'm trying to compare.

id | self_ranking
----------------------------------
1       | 9
2       | 3
3       | 2

The second table is a series of scores which the user gave to particular items.

id | user_id | item_id | score
----------------------------------
1  | 1       | 1       | 4
2  | 1       | 2       | 5
3  | 1       | 3       | 8
4  | 1       | 4       | 3

Task

I want to find the "most similar" user to a given one, valuing all the ranked items equally (along with the self score). Thus, a perfect match would be the user who has ranked all the same items in exactly the same manner & has rated himself the same, while the next most optimal choice would be one whose ranking of one item differs slightly.

I'm running into difficulty with:

  • Joining the two tables in an efficient manner
  • Dealing with the fact that not all users are ranking the same items. We only want to compare rankings of identical items.

Could someone help me construct a reasonable query? I'm not terribly strong with MySQL, so sorry if this answer should be obvious.

Output

If user 4 has ranked himself 8 and items 1=>4 and 2=>5, then I'd like to have the query for user 4's closest user to return 1, the user_id of the closest user.

回答1:

SELECT   u2.user_id

-- join our user to their scores
FROM     (users u1 JOIN scores s1 USING (user_id))

-- and then join other users and their scores
    JOIN (users u2 JOIN scores s2 USING (user_id))
      ON s1.item_id  = s2.item_id
     AND u1.user_id != u2.user_id

-- filter for our user of interest
WHERE    u1.user_id = ?

-- group other users' scores together
GROUP BY u2.user_id

-- and here's the magic: order in descending order of "distance" between
-- our selected user and all of the others: you may wish to weight
-- self_ranking differently to item scores, in which case just multiply
-- appropriately
ORDER BY SUM(ABS(s2.score - s1.score))
       + ABS(u2.self_ranking - u1.self_ranking) DESC


回答2:

In a slight refinement on @eggyal's method, I incorporated the number of items we are able to match against.

SELECT   u2.user_id

-- join our user to their scores
FROM     (users u1 JOIN scores s1 USING (user_id))

-- and then join other users and their scores
    JOIN (users u2 JOIN scores s2 USING (user_id))
      ON s1.item_id  = s2.item_id
     AND u1.user_id != u2.user_id

-- filter for our user of interest
WHERE    u1.user_id = ?

-- group other users' scores together
GROUP BY u2.user_id

-- subtract the degree of difference in correlating scores from the number of correlating scores
ORDER BY (SUM(s1.item_id = s2.item_id) - 
  ( SUM(ABS(s2.score - s1.score) + ABS(u2.self - u1.self) ) ) ) DESC