I'm trying to add features to a preexisting application and I came across a MySQL view something like this:
SELECT
AVG(table_name.col1),
AVG(table_name.col2),
AVG(table_name.col3),
table_name.personID,
table_name.col4
FROM table_name
GROUP BY table_name.personID;
OK so there's a few aggregate functions. You can select personID because you're grouping by it. But it also is selecting a column that is not in an aggregate function and is not a part of the GROUP BY clause. How is this possible??? Does it just pick a random value because the values definitely aren't unique per group?
Where I come from (MSSQL Server), that's an error. Can someone explain this behavior to me and why it's allowed in MySQL?
Let's say you have a query like this:
In this case, for each possible value for
g
, mysql picks one of the corresponding values ofv
.However, which one is chosen, depends on some circumstances.
I read somewhere that for each group of g, the first value of
v
is kept, in the order how the records were inserted into the tablet
.This is quite ugly because the records in a table should be treated as a set where the order of the elements should not matter. This is so "mysql-ish"...
If you want to determine which value for
v
to keep, you need to apply a subselect fort
like this:This way you define which order the records of the subquery are processed by the external query, thus you can trust which value of
v
it will pick for the individual values ofg
.However, if you need some WHERE conditions then be very careful. If you add the WHERE condition to the subquery then it will keep the behaviour, it will always return the value you expect:
This is what you expect, the subselect filters and orders the table. It keeps the records where
g
has the given value and the external query returns thatg
and the first value forv
.However, if you add the same WHERE condition to the outer query then you get a non-deterministic result:
Surprisingly, you may get different values for
v
when executing the same query again and again which is... strange. The expected behaviour is to get all the records in the appropriate order from the subquery, filtering them in the outer query and then picking the same as it picked in the previous example. But it does not.It picks a value for
v
seemingly randomly. The same query returned different values forv
if I executed more (~20) times but the distribution was not uniform.If instead of adding an outer WHERE, you specify a HAVING condition like this:
Then you get a consistent behaviour again.
CONCLUSION: I would suggest not to rely on this technique at all. If you really want/need to then avoid WHERE conditions in the outer query. Use it in the inner query if you can or a HAVING clause in the outer query.
I tested it with this data:
in mysql 5.6.41.
Maybe it is just a bug that gets/got fixed in newer versions, please give feedback if you have experience with newer versions.
I should have Googled for just a bit longer... It seems I found my answer.
Still, that just seems... wrong.
It's true that this feature permits some ambiguous queries, and silently returns a result set with an arbitrary value picked from that column. In practice, it tends to be the value from the row within the group that is physically stored first.
These queries aren't ambiguous if you only choose columns that are functionally dependent on the column(s) in the GROUP BY criteria. In other words, if there can be only one distinct value of the "ambiguous" column per value that defines the group, there's no problem. This query would be illegal in Microsoft SQL Server (and ANSI SQL), even though it cannot logically result in ambiguity:
Also, MySQL has an SQL mode to make it behave per the standard:
ONLY_FULL_GROUP_BY
FWIW, SQLite also permits these ambiguous GROUP BY clauses, but it chooses the value from the last row in the group.†
† At least in the version I tested. What it means to be arbitrary is that either MySQL or SQLite could change their implementation in the future, and have some different behavior. You should therefore not rely on the behavior staying they way it is currently in ambiguous cases like this. It's better to rewrite your queries to be deterministic and not ambiguous. That's why MySQL 5.7 now enables ONLY_FULL_GROUP_BY by default.