Number of times a nested query is executed

2020-04-30 01:20发布

How many times will this nested sub-query be executed?

SELECT CID, CNAME 
  FROM CUSTOMER
 WHERE EXISTS ( SELECT CID 
                  FROM RENTALS 
                 WHERE CUSTOMER.CID = RENTALS.CID 
                   AND PICKUP = 'CARY' )

This is a theoretical question i.e. it is found in my book. The provided answer was 6, but I don't understand why this is so.


Ok, I think there is some problem with the book itself. I will go through the book and maybe, later ask the question.

3条回答
仙女界的扛把子
2楼-- · 2020-04-30 01:46

The query won't be executed that way. The query planner will turn that into something where it will run one query against the result of the other, not running the subquery over and over.

Most likely the query plan for your query will look the same as for a query where you make a left join and check for existing records:

select c.CID, c.CNAME
from CUSTOMER c
left join RENTALS r on c.CID = r.CID and r.PICKUP = 'CARY'
where r.CID is not null
查看更多
萌系小妹纸
3楼-- · 2020-04-30 01:53

As others point out, this correlated subquery can be rewritten as a join, but that is not entirely the complete story because the execution plan for an untransformed EXISTS is going to look pretty much like a join anyway. So this is not really a syntax question but a query optimisation issue.

EXISTS is really just syntactic shorthand for "join to this data set but only to a single row in it even if there are 1,000,000 matches", or what is also known as a semijoin.

So the semijoin required by an EXISTS predicate against a correlated or uncorrelated subquery can be implemented in a number of ways, which depend to a large extent on the numbers or records in the two tables.

If you imagine that CUSTOMER is estimated to have a single row, and the optimiser estimates that there are many thousands of rows in RENTALS for which PICKUP = 'CARY', then the optimiser's is pretty likely to read the row from the CUSTOMER TABLE and perform a single lookup against the RENTALS table.

If there are an estimated one million CUSTOMERS and only one row in the RENTALS table then that execution plan would be crazy -- the optimiser could instead invert the join by leading with the RENTALS table and looking up against the CUSTOMER table the single row to be returned. In that case the subquery has arguably not been executed at all.

In between these extremes there are various other optimisation. For example, building the unique values of the RENTAL.CID column into an in-memory hash table for rows where PICKUP='CARY' and full-scanning the CUSTOMER TABLE to probe that hash table for every row, which would be a HASH SEMIJOIN. Again, no execution of a recognisable subquery. (And the query rewrite that Barmer suggests is likely to lead to that kind of plan, but might also constrain the optimiser from seeing other plans suitable for other data distributions and cardinalities).

So, as other answers say, the question is really moot, because I think that there are two important lessons here:

  1. Many different SQL statements can lead to the same execution plan, and a single SQL statement can also lead to multiple execution plans.
  2. You should write queries that express syntactically the result that you want, and not in general to preempt and constrain the query optimiser's choices.

The second point is important, as it argues against some developer's instinct to avoid writing correlated subqueries (and EXISTS or NOT EXISTS in particular it seems) in the hope of providing their own optimisation. In particular, replacing an EXISTS with an outer join can be very suboptimal given the right/wrong data distributions.

To get to the direct answer to the question, I'd say either:

  • 0
  • 1
  • However many rows are in CUSTOMERS
  • Possibly something else.
查看更多
三岁会撩人
4楼-- · 2020-04-30 01:55

There's no correct theoretical answer to this question. A smart query optimizer can transform the query into a JOIN:

SELECT CID, CNAME
FROM Customer
LEFT JOIN (SELECT DISTINCT CID FROM Rentals WHERE PICKUP = 'CARY') as Rentals
ON Customer.CID = Rentals.CID
WHERE Rentals.CID IS NOT NULL

Now the sub-query is just executed once.

查看更多
登录 后发表回答