Select a random entry from a group after grouping

2019-07-02 12:12发布

I want to write a query using Postgres and PostGIS. I'm also using Rails with rgeo, rgeo-activerecord and activerecord-postgis-adapter, but the Rails stuff is rather unimportant.

The table structure:

measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)

Example data:

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...

This table is some kind of manually created view for faster lookups (with millions of rows). Else we'd have to join 8 tables and it would get even slower. But that's not part of the problem.


Simple version:

Parameters:

  • Point p
  • int d

What the query should do:

1. The query looks for all groundtruth Points which have a distance < d from Point p

SQL for that is pretty easy: WHERE st_distance(groundtruth, p) < d

2. Now we have a list of groundtruth points with their anchor_ids. As you can see in the table above, it is possible to have multiple identical groundtruth-anchor_id tuples. For example: anchor_id=3 and groundtruth=POINT(1 4).

3. Next I'd like to eliminate the identical tuples, by choosing one of them randomly(!). Why not simply take the first? Because the data column is different.

Choosing a random row in SQL: SELECT ... ORDER BY RANDOM() LIMIT 1

My problem with all of this is: I can imagine a solution using SQL LOOPs and lot's of subqueries, but there's for sure a solution using GROUP BY or some other methods which will make it faster.

Full version:

Basically the same as above with one difference: The input parameters change:

  • lot's of Points p1 ... p312456345
  • still one d

If the simple query is working, this could be done using a LOOP in SQL. But maybe there is a better (and faster) solution, because the database is really huge!


Solution

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();

Thanks to Erwin Brandstetter!

2条回答
手持菜刀,她持情操
2楼-- · 2019-07-02 12:30

I now cracked it, but the query is pretty slow...

WITH
  ps AS (
    SELECT unnest(p_array)
    ) AS p
  ),

  gtps AS (
    SELECT DISTINCT ON(ps.p)
      ps.p, m.groundtruth
    FROM measurement m, ps
    WHERE st_distance(m.groundtruth, ps.p) < d
    ORDER BY ps.p, RANDOM()
  )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

My test database contains 22000 rows and I gave it two input values and it takes about 700ms. At the end there can be hundreds of input values :-/


The result now looks like this:

id  | anchor_id | groundtruth | p
-----------------------------------------
20  | 1         | POINT(0 2)  | POINT(1 0)
14  | 3         | POINT(0 2)  | POINT(1 0)
5   | 8         | POINT(0 2)  | POINT(1 0)
42  | 2         | POINT(4 1)  | POINT(2 2)
11  | 3         | POINT(4 8)  | POINT(4 8)
4   | 6         | POINT(4 8)  | POINT(4 8)
1   | 1         | POINT(6 2)  | POINT(7 3)
9   | 5         | POINT(6 2)  | POINT(7 3)
25  | 3         | POINT(6 2)  | POINT(9 1)
13  | 6         | POINT(6 2)  | POINT(9 1)
18  | 7         | POINT(6 2)  | POINT(9 1)

NEW:

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, 0.5)
GROUP BY groundtruth, ps.p

Actual result:

p           | groundtruth | anchor_arr | id_arr
--------------------------------------------------
P1          | G1          | {1,3,2,..} | {9,8,11,..}
P1          | G2          | {4,3,5,..} | {1,8,23,..}
P1          | G3          | {6,8,9,..} | {12,7,6,..}
P2          | G1          | {6,6,2,..} | {15,2,10,..}
P2          | G4          | {7,9,1,..} | {5,4,3,..}
...         | ...         | ...        | ...

So at the moment I get:

  • every distinct inputValue-groundtruth-tuple
  • for every tuple I get an array with all anchor_id corresponding the groundtruth part of the tuple
  • and an array of all ids corresponding the groundtruth-anchor_id relation

Remember:

  • two input values can 'select' the same groundtruth
  • a single groundtruth value can have multiple identical anchor_ids
  • each groundtruth-anchor_id-tuple has a distinct id

So what's missing for completion?:

  • I just need a random row for each ps.p
  • The two arrays belong to each other. Means: the order of the items inside is important!
  • Those two arrays need to get filtered (randomly):
    • For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id
查看更多
Luminary・发光体
3楼-- · 2019-07-02 12:36

To eliminate duplicates, this might be the most efficient query in PostgreSQL:

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d

More about this query style:

As mentioned in the comments this gives you an arbitrary pick. If you need random, somewhat more expensive:

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d
ORDER  BY anchor_id, groundtruth, random()

The second part is harder to optimize. EXISTS semi-join will probably be the fastest choice. For a given table ps (p point):

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
ORDER  BY anchor_id, groundtruth, random();

This can stop evaluating as soon as one p is close enough and it keeps the rest of the query simple.

Be sure to back that up with a matching GiST index.

If you have an array as input, create a CTE with unnest() on the fly:

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...

Update according to comment

If you only need a single row as answer, you can simplify:

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
LIMIT  1;

Faster with ST_DWithin()

Probably more efficient with the function ST_DWithin() (and a matching GiST index!).
To get one row (using a sub-select instead of a CTE here):

SELECT *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT  1;

To get one row for every point p within distance d:

SELECT DISTINCT ON (ps.p) *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)

Adding ORDER BY random() will make this query more expensive. Without random(), Postgres can just pick the first matching row from the GiST index. Else all possible matches have to be retrieved and ordered randomly.


BTW, LIMIT 1 inside EXISTS is pointless. Read the manual at the link I provided or this related question.

查看更多
登录 后发表回答