Retrieving the last record in each group - MySQL

2018-12-30 22:26发布

There is a table messages that contains data as shown below:

Id   Name   Other_Columns
-------------------------
1    A       A_data_1
2    A       A_data_2
3    A       A_data_3
4    B       B_data_1
5    B       B_data_2
6    C       C_data_1

If I run a query select * from messages group by name, I will get the result as:

1    A       A_data_1
4    B       B_data_1
6    C       C_data_1

What query will return the following result?

3    A       A_data_3
5    B       B_data_2
6    C       C_data_1

That is, the last record in each group should be returned.

At present, this is the query that I use:

SELECT
  *
FROM (SELECT
  *
FROM messages
ORDER BY id DESC) AS x
GROUP BY name

But this looks highly inefficient. Any other ways to achieve the same result?

21条回答
浅入江南
2楼-- · 2018-12-30 22:43
select * from messages group by name desc
查看更多
临风纵饮
3楼-- · 2018-12-30 22:46

I've not yet tested with large DB but I think this could be faster than joining tables:

SELECT *, Max(Id) FROM messages GROUP BY Name
查看更多
与君花间醉酒
4楼-- · 2018-12-30 22:48

Clearly there are lots of different ways of getting the same results, your question seems to be what is an efficient way of getting the last results in each group in MySQL. If you are working with huge amounts of data and assuming you are using InnoDB with even the latest versions of MySQL (such as 5.7.21 and 8.0.4-rc) then there might not be an efficient way of doing this.

We sometimes need to do this with tables with even more than 60 million rows.

For these examples I will use data with only about 1.5 million rows where the queries would need to find results for all groups in the data. In our actual cases we would often need to return back data from about 2,000 groups (which hypothetically would not require examining very much of the data).

I will use the following tables:

CREATE TABLE temperature(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT, 
  groupID INT UNSIGNED NOT NULL, 
  recordedTimestamp TIMESTAMP NOT NULL, 
  recordedValue INT NOT NULL,
  INDEX groupIndex(groupID, recordedTimestamp), 
  PRIMARY KEY (id)
);

CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id)); 

The temperature table is populated with about 1.5 million random records, and with 100 different groups. The selected_group is populated with those 100 groups (in our cases this would normally be less than 20% for all of the groups).

As this data is random it means that multiple rows can have the same recordedTimestamps. What we want is to get a list of all of the selected groups in order of groupID with the last recordedTimestamp for each group, and if the same group has more than one matching row like that then the last matching id of those rows.

If hypothetically MySQL had a last() function which returned values from the last row in a special ORDER BY clause then we could simply do:

SELECT 
  last(t1.id) AS id, 
  t1.groupID, 
  last(t1.recordedTimestamp) AS recordedTimestamp, 
  last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;

which would only need to examine a few 100 rows in this case as it doesn't use any of the normal GROUP BY functions. This would execute in 0 seconds and hence be highly efficient. Note that normally in MySQL we would see an ORDER BY clause following the GROUP BY clause however this ORDER BY clause is used to determine the ORDER for the last() function, if it was after the GROUP BY then it would be ordering the GROUPS. If no GROUP BY clause is present then the last values will be the same in all of the returned rows.

However MySQL does not have this so let's look at different ideas of what it does have and prove that none of these are efficient.

Example 1

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
  SELECT t2.id
  FROM temperature t2 
  WHERE t2.groupID = g.id
  ORDER BY t2.recordedTimestamp DESC, t2.id DESC
  LIMIT 1
);

This examined 3,009,254 rows and took ~0.859 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 2

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM temperature t1
INNER JOIN ( 
  SELECT max(t2.id) AS id   
  FROM temperature t2
  INNER JOIN (
    SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
    FROM selected_group g
    INNER JOIN temperature t3 ON t3.groupID = g.id
    GROUP BY t3.groupID
  ) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
  GROUP BY t2.groupID
) t5 ON t5.id = t1.id;

This examined 1,505,331 rows and took ~1.25 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 3

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM temperature t1
WHERE t1.id IN ( 
  SELECT max(t2.id) AS id   
  FROM temperature t2
  INNER JOIN (
    SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
    FROM selected_group g
    INNER JOIN temperature t3 ON t3.groupID = g.id
    GROUP BY t3.groupID
  ) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
  GROUP BY t2.groupID
)
ORDER BY t1.groupID;

This examined 3,009,685 rows and took ~1.95 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 4

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
  SELECT max(t2.id)
  FROM temperature t2 
  WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
      SELECT max(t3.recordedTimestamp)
      FROM temperature t3 
      WHERE t3.groupID = g.id
    )
);

This examined 6,137,810 rows and took ~2.2 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 5

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
  SELECT 
    t2.id, 
    t2.groupID, 
    t2.recordedTimestamp, 
    t2.recordedValue, 
    row_number() OVER (
      PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
    ) AS rowNumber
  FROM selected_group g 
  INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;

This examined 6,017,808 rows and took ~4.2 seconds on 8.0.4-rc

Example 6

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM (
  SELECT 
    last_value(t2.id) OVER w AS id, 
    t2.groupID, 
    last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp, 
    last_value(t2.recordedValue) OVER w AS recordedValue
  FROM selected_group g
  INNER JOIN temperature t2 ON t2.groupID = g.id
  WINDOW w AS (
    PARTITION BY t2.groupID 
    ORDER BY t2.recordedTimestamp, t2.id 
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  )
) t1
GROUP BY t1.groupID;

This examined 6,017,908 rows and took ~17.5 seconds on 8.0.4-rc

Example 7

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2 
  ON t2.groupID = g.id 
  AND (
    t2.recordedTimestamp > t1.recordedTimestamp 
    OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
  )
WHERE t2.id IS NULL
ORDER BY t1.groupID;

This one was taking forever so I had to kill it.

查看更多
只靠听说
5楼-- · 2018-12-30 22:49

You can take view from here as well.

http://sqlfiddle.com/#!9/ef42b/9

FIRST SOLUTION

SELECT d1.ID,Name,City FROM Demo_User d1
INNER JOIN
(SELECT MAX(ID) AS ID FROM Demo_User GROUP By NAME) AS P ON (d1.ID=P.ID);

SECOND SOLUTION

SELECT * FROM (SELECT * FROM Demo_User ORDER BY ID DESC) AS T GROUP BY NAME ;
查看更多
大哥的爱人
6楼-- · 2018-12-30 22:51

I arrived at a different solution, which is to get the IDs for the last post within each group, then select from the messages table using the result from the first query as the argument for a WHERE x IN construct:

SELECT id, name, other_columns
FROM messages
WHERE id IN (
    SELECT MAX(id)
    FROM messages
    GROUP BY name
);

I don't know how this performs compared to some of the other solutions, but it worked spectacularly for my table with 3+ million rows. (4 second execution with 1200+ results)

This should work both on MySQL and SQL Server.

查看更多
看淡一切
7楼-- · 2018-12-30 22:52

If performance is really your concern you can introduce a new column on the table called IsLastInGroup of type BIT.

Set it to true on the columns which are last and maintain it with every row insert/update/delete. Writes will be slower, but you'll benefit on reads. It depends on your use case and I recommend it only if you're read-focused.

So your query will look like:

SELECT * FROM Messages WHERE IsLastInGroup = 1
查看更多
登录 后发表回答