I can't seem to find a relevant example out there.
I'm trying to return a sub-set of a table, and for each row in that table, I want to check how many children it has, and return that number as part of the result set.
Parent Table Columns:
PK_ID, Column1, Column2, FK1
For each FK1 in result set, select count(*) from child_table.
Final result set
3, col1text, col2text, 1(child)
5, col1texta, col2texta, 2(child)
6, col1textb, col2textb, 0(child)
9, col1textc, col2textc, 4(child)
I'm struggling with the best way to reference a column in the result set in another query, and then join them together again. Using T-sql
Ok, apparently, based on the upvotes for the other answer, this needs further explanation. Example (done with MySQL because I have it handy but the principle is universal to any SQL dialect):
CREATE TABLE Blah (
ID INT PRIMARY KEY,
SomeText VARCHAR(30),
ParentID INT
)
INSERT INTO Blah VALUES (1, 'One', 0);
INSERT INTO Blah VALUES (2, 'Two', 0);
INSERT INTO Blah VALUES (3, 'Three', 1);
INSERT INTO Blah VALUES (4, 'Four', 1);
INSERT INTO Blah VALUES (5, 'Five', 4);
Left join version:
SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
Wrong. Ignores the case with no children.
Left outer join:
SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
Wrong and the reason why is somewhat subtle. COUNT(1)
counts NULL
rows whereas COUNT(b.ID)
doesn't. So the above is wrong but this is correct:
SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
Correlated subquery:
SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a
Also correct.
Ok, so which to use? Plans only tell you so much. The issue of subqueries vs left-joins is an old one and there's no clear answer without benchmarking it. So we need some data:
<?php
ini_set('max_execution_time', 180);
$start = microtime(true);
echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
echo mysql_error();
exit();
}
mysql_select_db('scratch');
if (mysql_error()) {
echo mysql_error();
exit();
}
$count = 0;
$limit = 1000000;
$this_level = array(0);
$next_level = array();
while ($count < $limit) {
foreach ($this_level as $parent) {
$child_count = rand(0, 3);
for ($i=0; $i<$child_count; $i++) {
$count++;
query("INSERT INTO Blah (ID, SomeText, ParentID) VALUES ($count, 'Text $count', $parent)");
$next_level[] = $count;
}
}
$this_level = $next_level;
$next_level = array();
}
$stop = microtime(true);
$duration = $stop - $start;
$inserttime = $duration / $count;
echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $inserttime seconds.\n";
echo "</pre>\n";
function query($query) {
mysql_query($query);
if (mysql_error()) {
echo mysql_error();
exit();
}
}
?>
I ran out of memory (32M) during this run so only ended up with 876,109 records but hey it will do. Later, when I test Oracle and SQL Server I take the exact same set of data and import it into Oracle XE and SQL Server Express 2005.
Now another poster raised the issue of my using a count wrapper around the queries. He correctly pointed out that the optimizer may not execute the subqueries in that case. MySQL doesn't seem to be that smart. Oracle is. SQL Server seems to be as well.
So I'll quote two figures for each database-query combination: the first is wrapped in SELECT COUNT(1) FROM ( ... )
, the second is raw.
Setup:
- MySQL 5.0 using PremiumSoft Navicat (
LIMIT 10000
in query);
- SQL Server Express 2005 using Microsoft SQL Server Management Studio Express;
- Oracle XE using PL/SQL Developer 7 (limited to 10,000 rows).
Left outer join:
SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
- MySQL: 5.0: 51.469s / 49.907s
- SQL Server: 0(1) / 9s(2)
- Oracle XE: 1.297s / 2.656s
(1) Virtually instantaneous (confirming the different execution path)
(2) Impressive considering it is returning all the rows, not 10,000
Just goes to show the value of a real database. Also, removing the SomeText field had a significant impact on MySQL's performance. Also there wasn't much difference between the limit of 10000 and not having it with MySQL (improving performance by a factor of 4-5). Oracle had it just because PL/SQL Developer barfed when it hit 100M memory usage.
Correlated Subquery:
SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a
- MySQL: 8.844s / 11.10s
- SQL Server: 0s / 6s
- Oracle: 0.046s / 1.563s
So MySQL is better by a factor of 4-5, Oracle is about twice as fast and SQL Server is arguably only a little faster.
The point remains: the correlated subquery version is faster in all cases.
The other advantage of correlated subqueries is that they are syntactically cleaner and easier to extend. By this I mean that if you want to do a count in a bunch of other tables, each can be included as another select item cleanly and easily. For example: imagine a record of customers to invoices where those invoices were either unpaid, overdue or paid. With a subquery that is easy:
SELECT id,
(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'UNPAID') unpaid_invoices,
(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'OVERDUE') overdue_invoices,
(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'PAID') paid_invoices
FROM customers c
The aggregate version is a lot uglier.
Now I'm not saying that subqueries are always superior to aggregate joins but often enough they are that you have to test it. Depending on your data, the size of that data and your RDBMS vendor the difference can be hugely significant.
I believe this is what you are trying to do:
SELECT P.PK_ID, P.Column1, P.Column2, COUNT(C.PK_ID)
FROM
Parent P
LEFT JOIN Child C ON C.PK_ID = P.FK1
GROUP BY
P.PK_ID, P.Column1, P.Column2
An explanation of why @cletus is wrong.
First, props on doing the research.
Second, you are doing it wrong.
Explanation:
Original query:
EXPLAIN
SELECT ID, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) as ChildCount
FROM Blah a
Result:
"Seq Scan on blah a (cost=0.00..145180063607.45 rows=2773807 width=4)"
" SubPlan"
" -> Aggregate (cost=52339.61..52339.63 rows=1 width=0)"
" -> Seq Scan on blah (cost=0.00..52339.59 rows=10 width=0)"
" Filter: (parentid = $0)"
What happens when you wrap in "select count(1)" :
EXPLAIN SELECT count(1) FROM (
SELECT ID, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) as ChildCount
FROM Blah a) as bar
"Aggregate (cost=52339.59..52339.60 rows=1 width=0)"
" -> Seq Scan on blah a (cost=0.00..45405.07 rows=2773807 width=0)"
Notice the difference?
The optimizer is smart enough to see that it doesn't need to do the subquery. So it's not that correlated subqueries are fast; it's that NOT DOING THEM is fast :-).
Unfortunately it can't do the same for a left outer join, since the number of results is not pre-determined by the first scan.
Lesson #1: The query plans tell you a hell of a lot. Poor experiment design gets you into trouble.
Lesson #1.1: If you don't need to do a join, by all means, don't.
I created a test dataset of roughly 2.7 million queries.
The left outer join -- without the wrapper -- ran 171,757 ms on my laptop.
The correlated subquery... I'll update when it finishes, I am at 700K ms and it's still running.
Lesson #2: When someone tells you to look at the query plan, and claims it's showing an algorithmic order of difference... look at the query plan.
Did you ever try to add an index to parent id for MySQL. I'm pretty sure the exection times will improve vastly. Haven't tested but I would say that MySQL goes through all rows to determine the count. Meaning that it does 10 - 40 billion (number of rows in the table * 10000) lookups in those 59 seconds.
Assume that SQL Server and Oracle create an index on the fly. If they do, it would be only 1 to 4 million.
Your queries all assume that the order that the parent child nodes are entered is sequential. If a child from one of the first nodes is entered at the end and its ID or PK is higher, then the query doesn't work.