I have a table:
create table purchase(
transaction_id integer,
account_id bigint,
created timestamp with time zone,
price numeric(5,2)
)
I think I have a problem where a system is sending me duplicate records, but I don't know how widespread the issue is.
I need a query to select all records created within 1 second (not necessary the same second) that has the same account_id and same price. So, for example, I would want to be able to find these two records:
+----------------+----------------+-------------------------------+-------+
| transaction_id | account_id | created | price |
+----------------+----------------+-------------------------------+-------+
| 85239 | 80012340116730 | 2014-05-07 15:46:03.361959+00 | 8.47 |
| 85240 | 80012340116730 | 2014-05-07 15:46:04.118911+00 | 8.47 |
+----------------+----------------+-------------------------------+-------+
How can I do this in a single query?
I'm using PostgreSQL 9.3.
You need to check for existence of a row within a second in both directions.
And you need to exclude the row itself from the test:
SELECT *
FROM purchase p
WHERE EXISTS (
SELECT 1
FROM purchase p1
WHERE p1.created > p.created - interval '1 sec' -- "less than a second"
AND p1.created < p.created + interval '1 sec'
AND p1.account_id = p.account_id
AND p1.price = p.price
AND p1.transaction_id <> p.transaction_id -- assuming that's the pk
)
ORDER BY account_id, price, created; -- optional, for handy output
These WHERE
conditions are sargable, which allows the use of an index on created
:
WHERE p1.created > p.created - interval '1 sec'
AND p1.created < p.created + interval '1 sec'
As opposed to:
p1.created - p.created < interval '1 sec'
The latter cannot use an index on created
which may slow down the query with big tables.
Postgres is forced to test all remaining combinations (after applying other conditions). Depending on the selectivity of the other conditions and the size of the table, this may be irrelevant or a moderate to huge performance drain.
With small to medium tables, tests have shown two sequence scans and a hash semi join for either query.
Index
The perfect index for the case would be a multicolumn index of the form:
CREATE INDEX purchase_foo_idx ON purchase (account_id, price, created)
But combinations of indexes on individual columns work rather well, too (and might have more use cases).
I think you're looking for something like this:
select *
from purchase p1
where exists (
select transaction_id
from purchase p2
where p2.created > p1.created
and p2.created - p1.created < interval '1 second'
and p2.account_id = p1.account_id
and p2.price = p1.price)
EDIT:
The query could be very heavy on big table. Consider limit it e.g. to one day:
select *
from purchase p1
where
p1.created::date = '2014-05-08'
and exists (
select transaction_id
from purchase p2
where p2.created::date = '2014-05-08'
and p2.created > p1.created
and p2.created - p1.created < interval '1 second'
and p2.account_id = p1.account_id
and p2.price = p1.price)
SQL Fiddle