There are plenty of similar questions to be found on here but I don't think that any answer the question adequately.
I'll continue from the current most popular question and use their example if that's alright.
The task in this instance is to get the latest post for each author in the database.
The example query produces unusable results as its not always the latest post that is returned.
SELECT wp_posts.* FROM wp_posts
WHERE wp_posts.post_status='publish'
AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author
ORDER BY wp_posts.post_date DESC
The current accepted answer is
SELECT
wp_posts.*
FROM wp_posts
WHERE
wp_posts.post_status='publish'
AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author
HAVING wp_posts.post_date = MAX(wp_posts.post_date) <- ONLY THE LAST POST FOR EACH AUTHOR
ORDER BY wp_posts.post_date DESC
Unfortunately this answer is plain and simple wrong and in many cases produces less stable results than the orginal query.
My best solution is to use a subquery of the form
SELECT wp_posts.* FROM
(
SELECT *
FROM wp_posts
ORDER BY wp_posts.post_date DESC
) AS wp_posts
WHERE wp_posts.post_status='publish'
AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author
My question is a simple one then: Is there anyway to order rows before grouping without resorting to a subquery?
Edit: This question was a continuation from another question and the specifics of my situation are slightly different. You can (and should) assume that there is also a wp_posts.id that is a unique identifier for that particular post.
Using an
ORDER BY
in a subquery is not the best solution to this problem.The best solution to get the
max(post_date)
by author is to use a subquery to return the max date and then join that to your table on both thepost_author
and the max date.The solution should be:
If you have the following sample data:
The subquery is going to return the max date and author of:
Then since you are joining that back to the table, on both values you will return the full details of that post.
See SQL Fiddle with Demo.
To expand on my comments about using a subquery to accurate return this data.
MySQL does not force you to
GROUP BY
every column that you include in theSELECT
list. As a result, if you onlyGROUP BY
one column but return 10 columns in total, there is no guarantee that the other column values which belong to thepost_author
that is returned. If the column is not in aGROUP BY
MySQL chooses what value should be returned.Using the subquery with the aggregate function will guarantee that the correct author and post is returned every time.
As a side note, while MySQL allows you to use an
ORDER BY
in a subquery and allows you to apply aGROUP BY
to not every column in theSELECT
list this behavior is not allowed in other databases including SQL Server.What you are going to read is rather hacky, so don't try this at home!
In SQL in general the answer to your question is NO, but because of the relaxed mode of the
GROUP BY
(mentioned by @bluefeet), the answer is YES in MySQL.Suppose, you have a BTREE index on (post_status, post_type, post_author, post_date). How does the index look like under the hood?
(post_status='publish', post_type='post', post_author='user A', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user A', post_date='2012-12-31') (post_status='publish', post_type='post', post_author='user B', post_date='2012-10-01') (post_status='publish', post_type='post', post_author='user B', post_date='2012-12-01')
That is data is sorted by all those fields in ascending order.
When you are doing a
GROUP BY
by default it sorts data by the grouping field (post_author
, in our case; post_status, post_type are required by theWHERE
clause) and if there is a matching index, it takes data for each first record in ascending order. That is the query will fetch the following (the first post for each user):(post_status='publish', post_type='post', post_author='user A', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user B', post_date='2012-10-01')
But
GROUP BY
in MySQL allows you to specify the order explicitly. And when you requestpost_user
in descending order, it will walk through our index in the opposite order, still taking the first record for each group which is actually last.That is
will give us
(post_status='publish', post_type='post', post_author='user B', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user A', post_date='2012-12-31')
Now, when you order the results of the grouping by post_date, you get the data you wanted.
NB:
This is not what I would recommend for this particular query. In this case, I would use a slightly modified version of what @bluefeet suggests. But this technique might be very useful. Take a look at my answer here: Retrieving the last record in each group
Pitfalls: The disadvantages of the approach is that
The advantage is performance in hard cases. In this case, the performance of the query should be the same as in @bluefeet's query, because of amount of data involved in sorting (all data is loaded into a temporary table and then sorted; btw, his query requires the
(post_status, post_type, post_author, post_date)
index as well).What I would suggest:
As I said, those queries make MySQL waste time sorting potentially huge amounts of data in a temporary table. In case you need paging (that is LIMIT is involved) most of the data is even thrown off. What I would do is minimize the amount of sorted data: that is sort and limit a minimum of data in the subquery and then join back to the whole table.
The same query using the approach described above:
All those queries with their execution plans on SQLFiddle.
No. It makes no sense to order the records before grouping, since grouping is going to mutate the result set. The subquery way is the preferred way. If this is going too slow you would have to change your table design, for example by storing the id of of the last post for each author in a seperate table, or introduce a boolean column indicating for each author which of his post is the last one.
First, don't use * in select, affects their performance and hinder the use of the group by and order by. Try this query:
When you don't specifies the table in ORDER BY, just the alias, they will order the result of the select.
** Sub queries may have a bad impact on performance when used with large datasets **
Original query
Modified query
becasue i'm using
max
in theselect clause
==>max(p.post_date)
it is possible to avoid sub select queries and order by the max column after the group by.Your solution makes use of an extension to GROUP BY clause that permits to group by some fields (in this case, just
post_author
):and select nonaggregated columns:
that are not listed in the group by clause, or that are not used in an aggregate function (MIN, MAX, COUNT, etc.).
Correct use of extension to GROUP BY clause
This is useful when all values of non-aggregated columns are equal for every row.
For example, suppose you have a table
GardensFlowers
(name
of the garden,flower
that grows in the garden):and you want to extract all the flowers that grows in a garden, where multiple flowers grow. Then you have to use a subquery, for example you could use this:
If you need to extract all the flowers that are the only flowers in the garder instead, you could just change the HAVING condition to
HAVING COUNT(DISTINCT flower)=1
, but MySql also allows you to use this:no subquery, not standard SQL, but simpler.
Incorrect use of extension to GROUP BY clause
But what happens if you SELECT non-aggregated columns that are non equal for every row? Which is the value that MySql chooses for that column?
It looks like MySql always chooses the FIRST value it encounters.
To make sure that the first value it encounters is exactly the value you want, you need to apply a
GROUP BY
to an ordered query, hence the need to use a subquery. You can't do it otherwise.Given the assumption that MySql always chooses the first row it encounters, you are correcly sorting the rows before the GROUP BY. But unfortunately, if you read the documentation carefully, you'll notice that this assumption is not true.
When selecting non-aggregated columns that are not always the same, MySql is free to choose any value, so the resulting value that it actually shows is indeterminate.
I see that this trick to get the first value of a non-aggregated column is used a lot, and it usually/almost always works, I use it as well sometimes (at my own risk). But since it's not documented, you can't rely on this behaviour.
This link (thanks ypercube!) GROUP BY trick has been optimized away shows a situation in which the same query returns different results between MySql and MariaDB, probably because of a different optimization engine.
So, if this trick works, it's just a matter of luck.
The accepted answer on the other question looks wrong to me:
wp_posts.post_date
is a non-aggregated column, and its value will be officially undetermined, but it will likely be the firstpost_date
encountered. But since the GROUP BY trick is applied to an unordered table, it is not sure which is the firstpost_date
encountered.It will probably returns posts that are the only posts of a single author, but even this is not always certain.
A possible solution
I think that this could be a possible solution:
On the inner query I'm returning the maximum post date for every author. I'm then taking into consideration the fact that the same author could theorically have two posts at the same time, so I'm getting only the maximum ID. And then I'm returning all rows that have those maximum IDs. It could be made faster using joins instead of IN clause.
(If you're sure that
ID
is only increasing, and ifID1 > ID2
also means thatpost_date1 > post_date2
, then the query could be made much more simple, but I'm not sure if this is the case).