I have a simple question about the most efficient way to perform a particular join.
Take these three tables, real names have been changed to protect the innocent:
Table: animal
animal_id name ... ====================== 1 bunny 2 bear 3 cat 4 mouse
Table: tags
tag_id tag ================== 1 fluffy 2 brown 3 cute 4 small
Mapping Table: animal_tag
animal_id tag_id ================== 1 1 1 2 1 3 2 2 3 4 4 2
I want to find all animals that are tagged as 'fluffy', 'brown', and 'cute'. That is to say that the animal must be tagged with all three. In reality, the number of required tags can vary, but should be irrelevant for this discussion. This is the query I came up with:
SELECT * FROM animal
JOIN (
SELECT at.animal_id FROM animal_tag at
WHERE at.tag_id IN (
SELECT tg.tag_id FROM tag tg
WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute'
)
GROUP BY at.animal_id HAVING COUNT(at.tag_id)=3
) AS jt
ON animal.animal_id=jt.animal_id
On a table with thousands 'animals' and and hundreds of 'tags', this query performs respectably ... 10s of milliseconds. However, when i look at the query plan (Apache Derby is the DB), the optimizer's estimated cost is pretty high (9945.12) and the plan pretty extensive. For a query this "simple" I usually try to get query plans with an estimated cost of single or double digits.
So my question is, is there a better way to perform this query? Seems like a simple query, but I've been stumped coming up with anything better.
First of all, a huge thanks to everyone who jumped in on this. Ultimately the answer is, as referenced by several commenters, relational division.
While I did take a course in Codd's relational data model many moons ago, the course like many, did not really cover relational division. Unwittingly, my original query is actually an application of Relational Division.
Referring to a slide 26-27 in this presentation on relational division, my query applies the technique of comparing set cardinalities. I tried some of the other methods mentioned for applying relational division but, at least in my case, the counting method provides the fastest run-time. I encourage anyone interested in this problem to read the aforementioned slide stack, as well as the article referenced on this page by Mikael Eriksson. Again, thanks to everyone.
I was wondering how bad it would be to use a relational division there. Can you please give it a run? I know this will take more, but I'm intrigued by how much :) If you can provide both the estimated cost and the time, it would be great.
Now looking for a fast query, I can't think in any faster than john's or yours. Actually john's might be a little slower than yours because he's performing unnencesary operations (remove distinct and remove count(*) from select):
This should be as fast as yours.
PS: Is there any way to remove that damned 3 without duplicating the where clause? My brain is boiling :)
You could create a temp table using DECLARE GLOBAL TEMPORARY TABLE And then do an INNER JOIN to eliminate the "WHERE IN". Working with Joins which are set based is usually far more efficient than Where statements that have to be evaluated for each row.
try this:
UPDATE
Give this a spin:
Here is another option, just for the fun of it:
I don't really expect this last option to do well... the other options avoid needing to go back to the tag table multiple times to resolve a tag name from the id... but you never know what the query optimizer will do until you try it.