Rails - Distinct ON after a join

2020-02-26 13:02发布

问题:

I am using Rails 4.2 with PostgreSQL. I have a Product model and a Purchase model with Product has many Purchases. I want to find the distinct recently purchased products. Initially I tried:

Product.joins(:purchases)
.select("DISTINCT products.*, purchases.updated_at") #postgresql requires order column in select
.order("purchases.updated_at DESC")

This however results in duplicates because it tries to find all tuples where the pair (product.id and purchases.updated_at) has a unique value. However I just want to select the products with distinct id after the join. If a product id appears multiple times in the join, only select the first one. So I also tried:

Product.joins(:purchases)
.select("DISTINCT ON (product.id) purchases.updated_at, products.*")
.order("product.id, purchases.updated_at") #postgres requires that DISTINCT ON must match the leftmost order by clause

This doesn't work because I need to specify product.id in the order clause because of this constraint which outputs unexpected order.

What is the rails way to achieve this?

回答1:

Use a subquery and add a different ORDER BY clause in the outer SELECT:

SELECT *
FROM  (
   SELECT DISTINCT ON (pr.id)
          pu.updated_at, pr.*
   FROM   Product pr
   JOIN   Purchases pu ON pu.product_id = pr.id  -- guessing
   ORDER  BY pr.id, pu.updated_at DESC NULLS LAST
   ) sub
ORDER  BY updated_at DESC NULLS LAST;

Details for DISTINCT ON:

  • Select first row in each GROUP BY group?

Or some other query technique:

  • Optimize GROUP BY query to retrieve latest record per user

But if all you need from Purchases is updated_at, you can get this cheaper with a simple aggregate in a subquery before you join:

SELECT *
FROM   Product pr
JOIN  (
   SELECT product_id, max(updated_at) AS updated_at
   FROM   Purchases 
   GROUP  BY 1
   ) pu ON pu.product_id = pr.id  -- guessing
ORDER  BY pu.updated_at DESC NULLS LAST;

About NULLS LAST:

  • PostgreSQL sort by datetime asc, null first?

Or even simpler, but not as fast while retrieving all rows:

SELECT pr.*, max(updated_at) AS updated_at
FROM   Product pr
JOIN   Purchases pu ON pu.product_id = pr.id
GROUP  BY pr.id  -- must be primary key
ORDER  BY 2 DESC NULLS LAST;

Product.id needs to be defined as primary key for this to work. Details:

  • PostgreSQL - GROUP BY clause
  • Return a grouped list with occurrences using Rails and PostgreSQL

If you fetch only a small selection (with a WHERE clause restricting to just one or a few pr.id for instance), this will be faster.



回答2:

So building on @ErwinBrandstetter answer, I finally found the right way of doing this. The query to find distinct recent purchases is

SELECT *
FROM  (
   SELECT DISTINCT ON (pr.id)
          pu.updated_at, pr.*
   FROM   Product pr
   JOIN   Purchases pu ON pu.product_id = pr.id
   ) sub
ORDER  BY updated_at DESC NULLS LAST;

The order_by isn't needed inside the subquery, since we are anyway ordering in the outer query.

The rails way of doing this is -

inner_query = Product.joins(:purchases)
  .select("DISTINCT ON (products.id) products.*, purchases.updated_at as date") #This selects all the unique purchased products.

result = Product.from("(#{inner_query.to_sql}) as unique_purchases")
  .select("unique_purchases.*").order("unique_purchases.date DESC")

The second (and better) way to do this as suggested by @ErwinBrandstetter is

SELECT *
FROM   Product pr
JOIN  (
   SELECT product_id, max(updated_at) AS updated_at
   FROM   Purchases 
   GROUP  BY 1
   ) pu ON pu.product_id = pr.id
ORDER  BY pu.updated_at DESC NULLS LAST;

which can written in rails as

join_query = Purchase.select("product_id, max(updated_at) as date")
  .group(1) #This selects most recent date for all purchased products

result = Product.joins("INNER JOIN (#{join_query.to_sql}) as unique_purchases ON products.id = unique_purchases.product_id")
  .order("unique_purchases.date")


回答3:

To build on erwin-brandstetter's answer, this is how you could do this with ActiveRecord (should be close at least):

Product
  .select('*')
  .joins('INNER JOIN (SELECT product_id, max(updated_at) AS updated_at FROM Purchases GROUP  BY 1) pu ON pu.product_id = pr.id')
  .order('pu.updated_at DESC NULLS LAST')


回答4:

I ended up with this -

Product.joins(:purchases)
.select("DISTINCT ON (products.id) products.*, purchases.updated_at as date")
.sort_by(&:date)
.reverse

Still looking for a better way to do this.



回答5:

Try to do this:

Product.joins(:purchases)
.select("DISTINCT ON (products_id) purchases.product_id, purchases.updated_at, products.*")
.order("product_id, purchases.updated_at") #postgres requires that DISTINCT ON must match the leftmost order by clause