Selecting all records created less than 1 second a

2019-04-13 19:06发布

问题:

I have a table:

create table purchase(
   transaction_id integer,
   account_id bigint,
   created timestamp with time zone,
   price numeric(5,2)
)

I think I have a problem where a system is sending me duplicate records, but I don't know how widespread the issue is.

I need a query to select all records created within 1 second (not necessary the same second) that has the same account_id and same price. So, for example, I would want to be able to find these two records:

+----------------+----------------+-------------------------------+-------+
| transaction_id |   account_id   |            created            | price |
+----------------+----------------+-------------------------------+-------+
|          85239 | 80012340116730 | 2014-05-07 15:46:03.361959+00 |  8.47 |
|          85240 | 80012340116730 | 2014-05-07 15:46:04.118911+00 |  8.47 |
+----------------+----------------+-------------------------------+-------+

How can I do this in a single query?

I'm using PostgreSQL 9.3.

回答1:

You need to check for existence of a row within a second in both directions.
And you need to exclude the row itself from the test:

SELECT *
FROM   purchase p
WHERE  EXISTS (
   SELECT 1
   FROM  purchase p1
   WHERE p1.created > p.created - interval '1 sec' -- "less than a second"
   AND   p1.created < p.created + interval '1 sec'
   AND   p1.account_id = p.account_id
   AND   p1.price      = p.price
   AND   p1.transaction_id <> p.transaction_id   -- assuming that's the pk
   )
ORDER BY account_id, price, created;         -- optional, for handy output

These WHERE conditions are sargable, which allows the use of an index on created:

WHERE p1.created > p.created - interval '1 sec'
AND   p1.created < p.created + interval '1 sec'

As opposed to:

p1.created - p.created < interval '1 sec'

The latter cannot use an index on created which may slow down the query with big tables. Postgres is forced to test all remaining combinations (after applying other conditions). Depending on the selectivity of the other conditions and the size of the table, this may be irrelevant or a moderate to huge performance drain.
With small to medium tables, tests have shown two sequence scans and a hash semi join for either query.

Index

The perfect index for the case would be a multicolumn index of the form:

CREATE INDEX purchase_foo_idx ON purchase (account_id, price, created)

But combinations of indexes on individual columns work rather well, too (and might have more use cases).



回答2:

I think you're looking for something like this:

select *
from purchase p1
where exists (
    select transaction_id 
    from purchase p2 
    where p2.created > p1.created
    and p2.created - p1.created < interval '1 second'
    and p2.account_id = p1.account_id
    and p2.price = p1.price)

EDIT: The query could be very heavy on big table. Consider limit it e.g. to one day:

select *
from purchase p1
where 
    p1.created::date = '2014-05-08'
    and exists (
        select transaction_id 
        from purchase p2 
        where p2.created::date = '2014-05-08'
        and p2.created > p1.created
        and p2.created - p1.created < interval '1 second'
        and p2.account_id = p1.account_id
        and p2.price = p1.price)

SQL Fiddle