I have a lot of SQL queries like this:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o
WHERE o.Id IN (
SELECT DISTINCT Id
FROM table1
, table2
, table3
WHERE ...
)
These queries have to run on different database engines (MySql, Oracle, DB2, MS-Sql, Hypersonic), so I can only use common SQL syntax.
Here I read, that with MySql the IN
statement isn't optimized and it's really slow, so I want to switch this into a JOIN
.
I tried:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o, table2, table3
WHERE ...
But this does not take into account the DISTINCT
keyword.
Question: How do I get rid of the duplicate rows using the JOIN
approach?
You do not need the
distinct
in the sub-query. Thein
will return one row in the outer query regardless of whether it matches one row or one hundred rows in the sub-query. So, if you want to improve the performance of the query, junking thatdistinct
would be a good start.One way of tuning
in
clauses is to rewrite them usingexists
instead. Depending on the distribution of data this may be a lot more efficient, or it may be slower. With tuning, the benchmark is king.Not knowing your business logic the precise formulation of that additional filter may be wrong.
Incidentally I notice that you have
table1
in both the outer query and the sub-query. If that is not a mistake in transcribing your actual SQL to here you may want to consider whether that makes sense. It would be better to avoid querying that table twice; usingexists
make make it easier to avoid the double hit.Though if you need to support a number of different database back ends you probably want to give each its own set of repository classes in your data layer, so you can optimize your queries for each. This also gives you the power to persist in other types of databases, or xml, or web services, or whatever should the need arise down the road.
I'm not sure to really understand what is your problem. Why don't you try this :
To write this with a JOIN you can use an inner select and join with that:
I'm not sure this will be much faster, but maybe... you can try it for yourself.
In general restricting yourself only to SQL that will work on multiple databases is not going to result in the best performance.