Monday, April 17, 2006

What is a Correlated Subquery?

What is a Correlated Subquery?
A correlated subquery is a SELECT statement nested inside another T-SQL statement, which contains a reference to one or more columns in the outer query. Therefore, the correlated subquery can be said to be dependent on the outer query. This is the main difference between a correlated subquery and just a plain subquery. A plain subquery is not dependent on the outer query, can be run independently of the outer query, and will return a result set. A correlated subquery, since it is dependent on the outer query will return a syntax errors if it is run by itself.

A correlated subquery will be executed many times while processing the T-SQL statement that contains the correlated subquery. The correlated subquery will be run once for each candidate row selected by the outer query. The outer query columns, referenced in the correlated subquery, are replaced with values from the candidate row prior to each execution. Depending on the results of the execution of the correlated subquery, it will determine if the row of the outer query is returned in the final result set.

Using a Correlated Subquery in a WHERE Clause
Suppose you want a report of all "OrderID's" where the customer did not purchase more than 10% of the average quantity sold for a given product. This way you could review these orders, and possibly contact the customers, to help determine if there was a reason for the low quantity order. A correlated subquery in a WHERE clause can help you produce this report. Here is a SELECT statement that produces the desired list of "OrderID's":

select distinct OrderId from Northwind.dbo.[Order Details] OD
where Quantity
from Northwind.dbo.[Order Details]
where OD.ProductID = ProductID)

The correlated subquery in the above command is contained within the parenthesis following the greater than sign in the WHERE clause above. Here you can see this correlated subquery contains a reference to "OD.ProductID". This reference compares the outer query's "ProductID" with the inner query's "ProductID". When this query is executed, the SQL engine will execute the inner query, the correlated subquery, for each "[Order Details]" record. This inner query will calculate the average "Quantity" for the particular "ProductID" for the candidate row being processed in the outer query. This correlated subquery determines if the inner query returns a value that meets the condition of the WHERE clause. If it does, the row identified by the outer query is placed in the record set that will be returned from the complete T-SQL SELECT statement.

The code below is another example that uses a correlated subquery in the WHERE clause to display the top two customers, based on the dollar amount associated with their orders, per region. You might want to perform a query like this so you can reward these customers, since they buy the most per region.

select CompanyName, ContactName, Address, City, Country, PostalCode
from Northwind.dbo.Customers OuterCwhere CustomerID in
( select top 2 InnerC.CustomerId from Northwind.dbo.[Order Details] OD join Northwind.dbo.Orders O on OD.OrderId = O.OrderID join Northwind.dbo.Customers InnerC on O.CustomerID = InnerC.CustomerId Where Region = OuterC.Region
group by Region, InnerC.CustomerId
order by sum(UnitPrice * Quantity * (1-Discount)) desc ) order by Region

Here you can see the inner query is a correlated subquery because it references "OuterC", which is the table alias for the "Northwind.DBO.Customer" table in the outer query. This inner query uses the "Region" value to calculate the top two customers for the region associated with the row being processed from the outer query. If the "CustomerID" of the outer query is one of the top two customers, then the record is placed in the record set to be returned.

Correlated Subquery in the HAVING Clause
Say your organizations wants to run a yearlong incentive program to increase revenue. Therefore, they advertise to your customers that if each order they place, during the year, is over $750 you will provide them a rebate at the end of the year at the rate of $75 per order they place. Below is an example of how to calculate the rebate amount. This example uses a correlated subquery in the HAVING clause to identify the customers that qualify to receive the rebate. Here is my code for this query:

select C.CustomerID, Count(*)*75 Rebate from Northwind.DBO.Customers C join Northwind.DBO.Orders O on c.CustomerID = O.CustomerID where Datepart(yy,OrderDate) = '1998' group by C.CustomerId having 750 < orderid =" OD.OrderID" customerid =" C.CustomerId">

By reviewing this query, you can see I am using a correlated query in the HAVING clause to calculate the total order amount for each customer order. I use the "CustomerID" from the outer query and the year of the order "Datepart(yy,OrderDate)", to help identify the Order records associated with each customer, that were placed the year '1998'. For these associated records I am calculating the total order amount, for each order, by summing up all the "[Order Details]" records, using the following formula:
sum(UnitPrice * Quantity * (1-Discount)). If each and every order for a customer, for year 1998 has a total dollar amount greater than 750, I then calculate the Rebate amount in the outer query using this formula "Count(*)*75 ".
SQL Server's query engine will only execute the inner correlated subquery in the HAVING clause for those customer records identified in the outer query, or basically only those customer that placed orders in "1998".

Performing an Update Statement Using a Correlated Subquery
A correlated subquery can even be used in an update statement. Here is an example:

create table A(A int, S int)
create table B(A int, B int)
-----------
set nocount on
insert into A(A) values(1)
insert into A(A) values(2)
insert into A(A) values(3)
insert into B values(1,1)
insert into B values(2,1)
insert into B values(2,1)
insert into B values(3,1)
insert into B values(3,1)
insert into B values(3,1)
update A set S = (select sum(B) from B where A.A = A group by A)
-----------
select * from A
-----------
drop table A,B

Here is the result set I get when I run this query on my machine:
A S
----------- -----------
1 1
2 2
3 3

In my query above, I used the correlated subquery to update column A in table A with the sum of column B in table B for rows that have the same value in column A as the row being updated.

Correlated in Condition:

AND EXISTS (SELECT TOP 1 1 FROM Table1 where {expression})


(Taken From Databasejournal article by Gregory A. Larsen)

No comments: