Aggregation of an annotation in GROUP BY in Django

2020-05-13 15:42发布

UPDATE

Thanks to the posted answer, I found a much simpler way to formulate the problem. The original question can be seen in the revision history.

The problem

I am trying to translate an SQL query into Django, but am getting an error that I don't understand.

Here is the Django model I have:

class Title(models.Model):
  title_id = models.CharField(primary_key=True, max_length=12)
  title = models.CharField(max_length=80)
  publisher = models.CharField(max_length=100)
  price = models.DecimalField(decimal_places=2, blank=True, null=True)

I have the following data:

publisher                    title_id      price  title
---------------------------  ----------  -------  -----------------------------------
New Age Books                PS2106         7     Life Without Fear
New Age Books                PS2091        10.95  Is Anger the Enemy?
New Age Books                BU2075         2.99  You Can Combat    Computer Stress!
New Age Books                TC7777        14.99  Sushi, Anyone?
Binnet & Hardley             MC3021         2.99  The Gourmet Microwave
Binnet & Hardley             MC2222        19.99  Silicon Valley   Gastronomic Treats
Algodata Infosystems         PC1035        22.95  But Is It User Friendly?
Algodata Infosystems         BU1032        19.99  The Busy Executive's   Database Guide
Algodata Infosystems         PC8888        20     Secrets of Silicon Valley

Here is what I want to do: introduce an annotated field dbl_price which is twice the price, then group the resulting queryset by publisher, and for each publisher, compute the total of all dbl_price values for all titles published by that publisher.

The SQL query that does this is as follows:

SELECT SUM(dbl_price) AS total_dbl_price, publisher
FROM (
  SELECT price * 2 AS dbl_price, publisher
  FROM title
) AS A 
GROUP BY publisher

The desired output would be:

publisher                    tot_dbl_prices
---------------------------  --------------
Algodata Infosystems                 125.88
Binnet & Hardley                      45.96
New Age Books                         71.86 

Django query

The query would look like:

Title.objects
 .annotate(dbl_price=2*F('price'))
 .values('publisher')
 .annotate(tot_dbl_prices=Sum('dbl_price'))

but gives an error:

KeyError: 'dbl_price'. 

which indicates that it can't find the field dbl_price in the queryset.

The reason for the error

Here is why this error happens: the documentation says

You should also note that average_rating has been explicitly included in the list of values to be returned. This is required because of the ordering of the values() and annotate() clause.

If the values() clause precedes the annotate() clause, any annotations will be automatically added to the result set. However, if the values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column.

So, the dbl_price could not be found in aggregation, because it was created by a prior annotate, but wasn't included in values().

However, I can't include it in values either, because I want to use values (followed by another annotate) as a grouping device, since

If the values() clause precedes the annotate(), the annotation will be computed using the grouping described by the values() clause.

which is the basis of how Django implements SQL GROUP BY. This means that I can't include dbl_price inside values(), because then the grouping will be based on unique combinations of both fields publisher and dbl_price, whereas I need to group by publisher only.

So, the following query, which only differs from the above in that I aggregate over model's price field rather than annotated dbl_price field, actually works:

Title.objects
 .annotate(dbl_price=2*F('price'))
 .values('publisher')
 .annotate(sum_of_prices=Count('price'))

because the price field is in the model rather than being an annotated field, and so we don't need to include it in values to keep it in the queryset.

The question

So, here we have it: I need to include annotated property into values to keep it in the queryset, but I can't do that because values is also used for grouping (which will be wrong with an extra field). The problem essentially is due to the two very different ways that values is used in Django, depending on the context (whether or not values is followed by annotate) - which is (1) value extraction (SQL plain SELECT list) and (2) grouping + aggregation over the groups (SQL GROUP BY) - and in this case these two ways seem to conflict.

My question is: is there any way to solve this problem (without things like falling back to raw sql)?

Please note: the specific example in question can be solved by moving all annotate statements after values, which was noted by several answers. However, I am more interested in solutions (or discussion) which would keep the annotate statement(s) before values(), for three reasons: 1. There are also more complex examples, where the suggested workaround would not work. 2. I can imagine situations, where the annotated queryset has been passed to another function, which actually does GROUP BY, so that the only thing we know is the set of names of annotated fields, and their types. 3. The situation seems to be pretty straightforward, and it would surprise me if this clash of two distinct uses of values() has not been noticed and discussed before.

4条回答
做自己的国王
2楼-- · 2020-05-13 16:13

This is maybe a bit too late, but I have found the solution (tested with Django 1.11.1).

The problem is, call to .values('publisher'), which is required to provide grouping, removes all annotations, that are not included in .values() fields param.

And we can't include dbl_price to fields param, because it will add another GROUP BY statement.

The solution in to make all aggregation, which require annotated fields firstly, then call .values() and include that aggregations to fields param(this won't add GROUP BY, because they are aggregations). Then we should call .annotate() with ANY expression - this will make django add GROUP BY statement to SQL query using the only non-aggregation field in query - publisher.

Title.objects
    .annotate(dbl_price=2*F('price'))
    .annotate(sum_of_prices=Sum('dbl_price'))
    .values('publisher', 'sum_of_prices')
    .annotate(titles_count=Count('id'))

The only minus with this approach - if you don't need any other aggregations except that one with annotated field - you would have to include some anyway. Without last call to .annotate() (and it should include at least one expression!), Django will not add GROUP BY to SQL query. One approach to deal with this is just to create a copy of your field:

Title.objects
    .annotate(dbl_price=2*F('price'))
    .annotate(_sum_of_prices=Sum('dbl_price')) # note the underscore!
    .values('publisher', '_sum_of_prices')
    .annotate(sum_of_prices=F('_sum_of_prices')

Also, mention, that you should be careful with QuerySet ordering. You'd better call .order_by() either without parametrs to clear ordering, or with you GROUP BY field. If the resulting query will contain ordering by any other field, the grouping will be wrong. https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by

Also, you might want to remove that fake annotation from your output, so call .values() again. So, final code looks like:

Title.objects
    .annotate(dbl_price=2*F('price'))
    .annotate(_sum_of_prices=Sum('dbl_price'))
    .values('publisher', '_sum_of_prices')
    .annotate(sum_of_prices=F('_sum_of_prices')
    .values('publisher', 'sum_of_prices')
    .order_by('publisher')
查看更多
老娘就宠你
3楼-- · 2020-05-13 16:13

Your problem comes from values() follow by annotate(). Order are important. This is explain in documentation about [order of annotate and values clauses]( https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#order-of-annotate-and-values-clauses)

.values('pub_id') limit the queryset field with pub_id. So you can't annotate on income

The values() method takes optional positional arguments, *fields, which specify field names to which the SELECT should be limited.

查看更多
该账号已被封号
4楼-- · 2020-05-13 16:31

This solution by @alexandr addresses it properly.

https://stackoverflow.com/a/44915227/6323666

What you require is this:

from django.db.models import Sum

Title.objects.values('publisher').annotate(tot_dbl_prices=2*Sum('price'))

Ideally I reversed the scenario here by summing them up first and then doubling it up. You were trying to double it up then sum up. Hope this is fine.

查看更多
Emotional °昔
5楼-- · 2020-05-13 16:39

This is expected from the way group_by works in Django. All annotated fields are added in GROUP BY clause. However, I am unable to comment on why it was written this way.

You can get your query to work like this:

Title.objects
  .values('publisher')
  .annotate(total_dbl_price=Sum(2*F('price'))

which produces following SQL:

SELECT publisher, SUM((2 * price)) AS total_dbl_price
FROM title
GROUP BY publisher

which just happens to work in your case.

I understand this might not be the complete solution you were looking for, but some even complex annotations can also be accommodated in this solution by using CombinedExpressions(I hope!).

查看更多
登录 后发表回答