I can't seem to find a relevant example out there.
I'm trying to return a sub-set of a table, and for each row in that table, I want to check how many children it has, and return that number as part of the result set.
Parent Table Columns: PK_ID, Column1, Column2, FK1
For each FK1 in result set, select count(*) from child_table.
Final result set
3, col1text, col2text, 1(child)
5, col1texta, col2texta, 2(child)
6, col1textb, col2textb, 0(child)
9, col1textc, col2textc, 4(child)
I'm struggling with the best way to reference a column in the result set in another query, and then join them together again. Using T-sql
I believe this is what you are trying to do:
An explanation of why @cletus is wrong.
First, props on doing the research.
Second, you are doing it wrong.
Explanation:
Original query:
Result:
What happens when you wrap in "select count(1)" :
Notice the difference?
The optimizer is smart enough to see that it doesn't need to do the subquery. So it's not that correlated subqueries are fast; it's that NOT DOING THEM is fast :-).
Unfortunately it can't do the same for a left outer join, since the number of results is not pre-determined by the first scan.
Lesson #1: The query plans tell you a hell of a lot. Poor experiment design gets you into trouble.
Lesson #1.1: If you don't need to do a join, by all means, don't.
I created a test dataset of roughly 2.7 million queries.
The left outer join -- without the wrapper -- ran 171,757 ms on my laptop.
The correlated subquery... I'll update when it finishes, I am at 700K ms and it's still running.
Lesson #2: When someone tells you to look at the query plan, and claims it's showing an algorithmic order of difference... look at the query plan.
Did you ever try to add an index to parent id for MySQL. I'm pretty sure the exection times will improve vastly. Haven't tested but I would say that MySQL goes through all rows to determine the count. Meaning that it does 10 - 40 billion (number of rows in the table * 10000) lookups in those 59 seconds.
Assume that SQL Server and Oracle create an index on the fly. If they do, it would be only 1 to 4 million.
Your queries all assume that the order that the parent child nodes are entered is sequential. If a child from one of the first nodes is entered at the end and its ID or PK is higher, then the query doesn't work.
Ok, apparently, based on the upvotes for the other answer, this needs further explanation. Example (done with MySQL because I have it handy but the principle is universal to any SQL dialect):
Left join version:
Wrong. Ignores the case with no children.
Left outer join:
Wrong and the reason why is somewhat subtle.
COUNT(1)
countsNULL
rows whereasCOUNT(b.ID)
doesn't. So the above is wrong but this is correct:Correlated subquery:
Also correct.
Ok, so which to use? Plans only tell you so much. The issue of subqueries vs left-joins is an old one and there's no clear answer without benchmarking it. So we need some data:
I ran out of memory (32M) during this run so only ended up with 876,109 records but hey it will do. Later, when I test Oracle and SQL Server I take the exact same set of data and import it into Oracle XE and SQL Server Express 2005.
Now another poster raised the issue of my using a count wrapper around the queries. He correctly pointed out that the optimizer may not execute the subqueries in that case. MySQL doesn't seem to be that smart. Oracle is. SQL Server seems to be as well.
So I'll quote two figures for each database-query combination: the first is wrapped in
SELECT COUNT(1) FROM ( ... )
, the second is raw.Setup:
LIMIT 10000
in query);Left outer join:
(1) Virtually instantaneous (confirming the different execution path)
(2) Impressive considering it is returning all the rows, not 10,000
Just goes to show the value of a real database. Also, removing the SomeText field had a significant impact on MySQL's performance. Also there wasn't much difference between the limit of 10000 and not having it with MySQL (improving performance by a factor of 4-5). Oracle had it just because PL/SQL Developer barfed when it hit 100M memory usage.
Correlated Subquery:
So MySQL is better by a factor of 4-5, Oracle is about twice as fast and SQL Server is arguably only a little faster.
The point remains: the correlated subquery version is faster in all cases.
The other advantage of correlated subqueries is that they are syntactically cleaner and easier to extend. By this I mean that if you want to do a count in a bunch of other tables, each can be included as another select item cleanly and easily. For example: imagine a record of customers to invoices where those invoices were either unpaid, overdue or paid. With a subquery that is easy:
The aggregate version is a lot uglier.
Now I'm not saying that subqueries are always superior to aggregate joins but often enough they are that you have to test it. Depending on your data, the size of that data and your RDBMS vendor the difference can be hugely significant.