CTE : simplify those nested sub queries

This article examines how sub queries can be substituted for the far more readable common table expressions, or CTEs available in many RDBMS systems. I was motivated to write this article when a friend who is fairly new to SQL expressed difficulty in grasping queries containing nested sub queries.

If you’ve never heard of CTEs before and you want to get the most out of this article, I recommend you get AdventureWorks2014 sample database and experiment a little with the queries below. Adventureworks is a nice sample database designed to demonstrate SQL Server features, and many examples online makes use of it. If on the other hand CTEs are old hat to you a quick skim reading of the below will suffice, but pay attention since there is something interesting below which you might never have known, or maybe forgotten.

Is it just me, or is this strange?

No-no, this thing came from long, long ago, was what I was thinking the first time I was confronted with SQL SELECT syntax.

It’s all the wrong way round. Take the most basic form of a SELECT statement for example…

  1. You start out with what you want in the column list while starting the SELECT statement.
  2. Next you proceed to say from where you want the data to come from, in the FROM clause.
  3. Then you stipulate how you want to filter what is returned when you write the WHERE clause.
  4. Finally you end off with how stuff should be ordered in the ORDER BY clause.

Now, I do realise that there are (were) probably very good reasons why this particular order was chosen, but would it not make more sense to rather start out with saying where you want to draw info from (FROM clause), proceed to filter that with the WHERE clause, followed by your preferred ordering in the ORDER BY, and finishing up with the list of which columns you are interested in, e.g:

  1. FROM
  2. WHERE
  3. ORDER BY
  4. SELECT

Well, it is really of no consequence now - we have come too far along this road, however, note that the fluent way of constructing queries with .NET C# Linq offer exactly this; a more logical flow of steps.

This post is about simplifying subqueries, so getting back on track, it turns out there is a powerful feature of SQL99 called Common Table Expressions, or CTEs which offers a great help when you are dealing with a monster query with little minions of subqueries, nested deeper in there than your TV remote that sunk into the couch folds.

Whereas sub query syntax quickly gets confusing and convoluted, using the CTE approach breaks down complexity and deals with each step of the workload in separate parts before moving on to constructing a more intertwined picture.

To me, it somewhat moves towards my “more ideal” ordering of statements, all be it no real alteration to the standard SELECT syntax.

Let’s have a look at an example.

Adventureworks customers and their addresses

In Adventureworks, customers are modeled in the person.BusinessEntity table, and their addresses in the person.Address table. The relation between customers and their addresses is maintained in the person.BusinessEntityAddress table, which adds a type for each address (Billing, Home, Main Office etc).

We want to know how many customers have one, two and three or more addresses.

Since the purpose of this article is to discuss subquery vs CTEs, excuse the somewhat contrived query.
At the end of this article I'll show  a simpler way to accomplish the tast at hand.

Sub query building block

We can establish that no BusinessEntity has more than two addresses by the empty result set returned by:

select ba.BusinessEntityId, count(ba.AddressId)
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2;

We’ll assume for the remainder of this article that this query is our best attempt, and that it is a “black box” with which we cannot tamper, save for adjusting the count in the HAVING clause.

Note that technically we do not need the additional join:

join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId

however this ensures we do not consider addresses in isolation - we ensure all addresses actually have an existing business tied to them.

Sub query with count of zero instead of nothing

Now we would like to improve this result a bit, by rather returning a count of zero instead of nothing. We do this by wrapping the above query in a super query:

select 
count(businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithMoreThanTwoAddresses
from
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2
) as businessEntitiesWithMoreThanTwoAddresses;

Not too bad in terms of readability, and since we alias the sub query we always know where the result for the outer select column list comes from. But what if we would like to augment this query to also return the number of businesses with only one, and two addresses respectively? Or to cover all possabilities, the number of businesses with one, two and three or more addresses respectively. Let’s try with the sub query approach:

Combined sub query approach

select 
count(businessEntitiesWithOneAddress.BusinessEntityId) as NumberOfBusinessesWithOneAddress,
count(businessEntitiesWithTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithTwoAddresses,
count(businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithMoreThanTwoAddresses
from
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 1
) as businessEntitiesWithOneAddress,
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 2
) as businessEntitiesWithTwoAddresses,
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2
) as businessEntitiesWithMoreThanTwoAddresses;

Hmm… We get zero for everything. Something has gone wrong. Running each super/sub query pair separately we obtain a number for the one and two address scenario, but nothing for the three or more one. Somehow, the zero result from the three or more address scenario is forcing the other two to become zero as well. Well, where else would two hole numbers and zero uniformly combined yeald zero? Multiplication of course, and observing how we unwittingly join the three subqueries with a comma (“,”) which is nothing but a cartesian or cross join, everything start to make sense! The cross join is causing our result set to first be the number of rows returned by the first subquery, but then these rows are “multiplied” in the first cross join with the two address subquery, at which point we are already off track, and finally, to make matters even worse, the last subquery returns zero rows, and anything multiplied by zero is of course again zero.

Fixup by throwing more sub queries at the problem

Now, let’s first remedy this situation by cross joining on only one row each time. Hold on, it is going to get a bit messy since in order to do this, we have to introduce yet another subquery.

select 
A.NumberOfBusinessesWithOneAddress,
B.NumberOfBusinessesWithTwoAddresses,
C.NumberOfBusinessesWithMoreThanTwoAddresses
from
(select 
count(businessEntitiesWithOneAddress.BusinessEntityId) as NumberOfBusinessesWithOneAddress
from
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 1
) as businessEntitiesWithOneAddress) as A,
(select
count(businessEntitiesWithTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithTwoAddresses
from
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 2
) as businessEntitiesWithTwoAddresses) as B,
(select
count(businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithMoreThanTwoAddresses
from
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2
) as businessEntitiesWithMoreThanTwoAddresses) as C;

Success, but at quite a complexity and readability cost.

First CTE attempt

Let’s see if we can simplify this into something more readable by using a few CTEs instead of so many subqueries.

Our query will consist of 3 parts:

  1. First, the three innermost sub queries are stated as CTEs named businessEntitiesWithOneAddress, businessEntitiesWithTwoAddresses and businessEntitiesWithMoreThanTwoAddresses.
  2. Next, the encapsulating super queries that reduce the sub queries to single values are also stated as CTEs named A, B and C.
  3. Finally we simply select from the aggregation CTEs A, B and C.
-- First, the three innermost sub queries are stated as CTEs named businessEntitiesWithOneAddress, businessEntitiesWithTwoAddresses and businessEntitiesWithMoreThanTwoAddresses.
with  businessEntitiesWithOneAddress as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 1
), businessEntitiesWithTwoAddresses as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 2
), businessEntitiesWithMoreThanTwoAddresses  as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2
), 

-- Next, the encapsulating super queries that reduce the sub queries to  single values are also stated as CTEs named A, B and C.
A as
(select count(businessEntitiesWithOneAddress.BusinessEntityId) as NumberOfBusinessesWithOneAddress
from businessEntitiesWithOneAddress
), B as
(select count(businessEntitiesWithTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithTwoAddresses
from businessEntitiesWithTwoAddresses
), C as
(select count(businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithMoreThanTwoAddresses
from businessEntitiesWithMoreThanTwoAddresses
)

-- Finally we simply select from the aggregation CTEs A, B and C.
select
A.NumberOfBusinessesWithOneAddress,
B.NumberOfBusinessesWithTwoAddresses,
C.NumberOfBusinessesWithMoreThanTwoAddresses
from A, B, C;

Is this not so much more readable?

This last query (with CTEs only) is functionally equivalent to the full query with only sub queries, yet, personally I find the CTE-based one significnatly more readable and understandable. For me there are many reasons for this, but I list a few:

  1. Instead of a nested style, the CTEs flow in a top to bottom manner.
  2. Each similar set of queries are grouped together.
  3. The CTEs are named upfront, so you have an idea of what is being done right from the start.

CTE support

CTE is well supported among database systems, and I was surprised to learn that even SqLite supports it!

Wikipedia says:

Recursive CTEs are also supported by

  • Microsoft SQL Server,
  • Firebird 2.1,
  • PostgreSQL 8.4+,
  • SQLite 3.8.3+,
  • Oracle 11g Release 2,
  • IBM Informix version 11.50+ and
  • CUBRID.

CTE support has been lacking from MySQL for some time now, however, derived tables can be used successfully to achieve standard or non recursive CTE functionality. The good news is that full featured CTE support, including recursiveness is not far off for MySQL.

But, can we not just join better?

Now that I’ve shown how the use of CTEs can eliminate sub queries, lets briefly return to the problem of the cartesian join. I really did not like the additional sub query, and it’s additional CTE counterpart, for reducing the result sets into something that could cross join without problems (those A, B and C). Is there not perhaps a way to join without throwing anything away?

Let’s consider the possible join types:

  1. CROSS JOIN : We already saw this does not work.
  2. INNER JOIN : This will also not work since per definition our three sets are disjoint / contain no similar items.
  3. LEFT, RIGHT JOIN : We have the same problem; our sets are disjoint.
  4. FULL JOIN : Bingo! Full joins attempt to join on the join condition, but thereafter behaves like a LEFT and RIGHT JOIN combined; nothing is lost, and unmatched rows are still returned.

Did you know about the full join? I will admit that although I knew there had to be something like it, I did not know / forgotten all about it.

Sub query approach simplified with a FULL JOIN

select 
count(businessEntitiesWithOneAddress.BusinessEntityId) as NumberOfBusinessesWithOneAddress,
count(businessEntitiesWithTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithTwoAddresses,
count(businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithMoreThanTwoAddresses
from
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 1
) as businessEntitiesWithOneAddress 
full join
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 2
) as businessEntitiesWithTwoAddresses 
on
businessEntitiesWithOneAddress.BusinessEntityId = 
businessEntitiesWithTwoAddresses.BusinessEntityId  
full join
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2
) as businessEntitiesWithMoreThanTwoAddresses
on
businessEntitiesWithTwoAddresses.BusinessEntityId   = 
businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId;

Yes, we do away with one level of sub queries, but it’s still not great.

CTE approach simplified with FULL JOIN

with  businessEntitiesWithOneAddress as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 1
), businessEntitiesWithTwoAddresses as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) = 2
), businessEntitiesWithMoreThanTwoAddresses  as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
having count(ba.AddressId) > 2
)
select 
count(businessEntitiesWithOneAddress.BusinessEntityId) as NumberOfBusinessesWithOneAddress,
count(businessEntitiesWithTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithTwoAddresses,
count(businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId) as NumberOfBusinessesWithMoreThanTwoAddresses
from businessEntitiesWithOneAddress
full join businessEntitiesWithTwoAddresses
on businessEntitiesWithOneAddress.BusinessEntityId = businessEntitiesWithTwoAddresses.BusinessEntityId
full join businessEntitiesWithMoreThanTwoAddresses
on businessEntitiesWithTwoAddresses.BusinessEntityId = businessEntitiesWithMoreThanTwoAddresses.BusinessEntityId

Simplest solution

Finally, if we relax the requirement (which was self imposed in the first place) to have the results in one row, as well as not worrying about returning the count for the three or more scenario, we can simplify everything very nice like this:

with  businessEntitiesAddressCount as
(select ba.BusinessEntityId, 
count(ba.AddressId) as NumberOfAddresses
from person.BusinessEntity b
join person.BusinessEntityAddress ba on ba.BusinessEntityId = b.BusinessEntityId
group by ba.BusinessEntityId
)
select 
count(BusinessEntityId) as NumberOfBusinesses,
NumberOfAddresses 
from businessEntitiesAddressCount 
group by NumberOfAddresses;

Best of all; we still get to use a wonderful CTE!

Final Thoughts

The first time I realised that it is possible to nest queries arbitrarily it seemed like the sky would be the limit to what wonderfully complex and deep stuff one could do, all be it with a heavy performance penalty if you were not careful. As time went on and I wrote larger and larger queries, often times with many levels of nested sub queries, it became apparent that the readability of the code drastically reduced. In general purpose programming languages one could avoid deep nested structures quite successfully, but in SQL there seemed to be no way around it save for making use of temporary tables, or breaking the work up into smaller parts.

When I found out about CTEs I was very impressed - what an elegant solution! Yes, a carefully thought out join can often times remove the need for a sub query or CTE, but in cases where sub query is required I now opt for the CTE approach instead.

Some prominant SQL experts even equate the use of CTEs / the WITH clause to literate programming, and I concur as far as readability and logical flow is concerned. There is much more to CTEs, especially when considering the recursive variaty, but that is a story for another time.

It’s been a bit of a ramble, but if you made it this far, you’ll do well to give CTEs a try, and maybe even find it as useful as I do.