Sometimes we need to show that the exact same data is produced by two different processes.
We might be replacing one implementation of an algorithm with a more efficient one, or we might be running the same algorithm on a different technology.
We want to show that the data in table_a
and table_b
is identical, or if it is not identical, we want to identify all differences.
In this article, I want to share with you a method that helped me do this.
SQL, being set-based, help us find differing rows easily and elegantly, but finding the differing columns requires a bit of careful and systematic manual work on our part.
Comparing on two dimensions
Assuming that the “shape” or schema of the two tables is equivalent (see notes below on how to determine this), we can consider how similar the contained data is by looking at the vertical (rows), and horizontal (columns) dimensions.
Vertical, or row difference
these possibilities exist:
- full disjunction :
table_a
andtable_b
differ completely and do not overlap at all - non empty union :
table_a
andtable_b
partially overlap, but differ in all other rows table_a
is entirely contained intable_b
:table_a
is a subset oftable_b
table_b
is entirely contained intable_a
:table_b
is a subset oftable_a
- set equivalence :
table_a
andtable_b
overlap completely and do not differ at all
It is worth noting that:
- We have set equivalence if we have both
table_a
is a subset oftable_b
, ANDtable_b
is a subset oftable_a
. - If row counts for the tables differ we can never have set equivalence.
- Even if row counts for the tables differ, we can still potentially have non empty union or a subset scenario.
- If row counts for the two tables are both
N
, , and the row count for the set intersection between the two tables is alsoN
, then we have set equivalence.
To find differing/similar rows, you might be tempted to construct elaborate joins involving all the columns, perhaps using FULL JOIN, but there exist a more elegant approach which involves some of SQL’s set operators.
SQL set operators INTERSECT and EXCEPT
Most people are familiar with the UNION
and UNION ALL
set operators, but INTERSECT
and EXCEPT
is perhaps a little less known.
As a reminder, UNION
between two queries will produce the set union, with all duplicates removed, while UNION ALL
simply unifies the two result sets into one, where there might exist duplicates.
UNION
is more expensive than UNION ALL
because UNION
performs a sort in order to do the implicit DISTINCT
.
INTERSECT
For obtaining a possible set intersection between two tables, we can use the INTERSECT
operator:
select * from table_a
intersect
select * from table_b
;
As pointed out earlier, if the row count for the intersection is the same as the row counts for the two tables, we have set equivalence:
select sum(row_count) = 0 as set_equivalence from (
select count(1) as row_count from (
select * from table_a
intersect
select * from table_b
)
union all
select count(1) as row_count from table_a
union all
select -2 * count(1) as row_count from table_b
);
EXCEPT
When we use the EXCEPT
set operator, for example, query_1 EXCEPT query_2
, we are left with all rows from query_1
that do not occur in query_2
.
Note that query_2
might contain rows that are not present in query_1
, but because EXCEPT
is not commutative, similar to how minus is not commutative, query_1 EXCEPT query_2
is not the same as query_2 EXCEPT
query_1
, and we are only left with non occurring rows in query_1
.
select * from table_a
except
select * from table_b;
…is not the same as:
select * from table_b
except
select * from table_a;
Some points to note about EXCEPT
:
- if the row counts for both tables are the same, then the row count for the result of applying the
EXCEPT
operator will always be equal to, or less than the row count for the tables. - If the row counts for the tables are equal, and the row count for the
EXCEPT
query is equal to that number, then the two tables differ completely. - If the row counts for the tables are equal), then the row count for an
EXCEPT
query between the two tables gives us the number of different rows between the two tables. - If the row counts for the tables are equal, and the row count for the
table_a EXCEPT table_b
query is equal to zero, then it means thattable_a
is completely contained intable_b
, but because the tables have the same number of rows, it must be thattable_b
andtable_a
are set equivalent. Obtaining the query results for bothtable_a EXCEPT table_b
andtable_b EXCEPT table_a
is especially useful when the row counts fortable_a
andtable_b
are equivalent. Armed with those two query results, we can now move on to the task of determining in which columns the differences occur.
Horizontal, or column differences
When we have both table_a EXCEPT table_b
and table_b EXCEPT table_a
result sets, we can do a further EXCEPT
query on them, in a special regime of commenting out columns, in order to determine which columns differ.
Setting things up for horizontal comparisons
Let’s call the result of table_a EXCEPT table_b
view except_ab
, and similarly table_b EXCEPT table_a
we call view except_ba
.
except_ab
and except_ba
could be actual views, or results persisted in new tables.
Persisting the result sets to actual tables is a better choice when the EXCEPT
opperator has to work through large data sets.
Let’s further say that both table_a
and table_b
have columns column_001
through column_100
- not an uncommonly large number of columns when dealing with analytics type, column store data sets.
We want to find out what subset of columns cause except_ab
and except_ba
to differ.
We start out by crafting a query of the form:
select count(1) from (
select
column_001
,column_002
,column_003
...
,column_100
from except_ab
except
select
column_001
,column_002
,column_003
...
,column_100
from except_ba
);
Of course, in practice, we rarely have such clean column names, but rather have all sorts of interesting and creative names to deal with. When we execute this query, we get the number of different rows - no surprises.
Performing a binary search over the column names
It is tempting at this point to start commenting out any column that appears suspicious… “That column_066
looks like a problem, let’s comment it out and run the query again” , you might reason:
select count(1) from (
select
column_001
,column_002
,column_003
...
--,column_066
...
,column_100
from except_ab
except
select
column_001
,column_002
,column_003
...
--,column_066
...
,column_100
from except_ba
);
Resist this urge in favor of a more systematic approach, which will lead you to the offending column(s) in the least amount of time. The basic idea of the systematic approach is not to try and identify individual columns, but rather finding regions of columns that contain differing column values, and then to shrink them one by one. Once we have identified all the column regions that are causing the count to be greater than zero, we can work on each of the regions in turn, to narrow down to the individual column level, which columns differ in values. When we found the regions that cause the two result sets to differ, we reduce those regions by halving them each time, until we end up with the smallest (in number of columns) sized regions that accounts for all the differences.
It is kind of awkward to describe the algorithm here in plain English, but hopefully you get the idea:
- Run the query and take note of the result - let’s call it $D$.
- Comment out half the columns of all currently uncommented columns. At the start, this would be from
column_001
tocolumn_050
. Do this in both branches of the query. - Run the query and take note of the new result - let’s call it $E$.
- If the count is the same as before, if $D = E$, it means that the offending columns definitely exist in the uncommented list of columns (and potentially also exist in the commented out ones).
- You can now try to invert the commented regions, and run the query again - start the process from step 1.
- If instead the count reduces to a number greater than zero, if $E > 0$, it means that you have found some of the differing columns: some of them are in the commented out list of columns.
- Now comment half of the previously uncommented region -
column_051
throughcolumn_075
, and run the query again (go to step 2.).
- Now comment half of the previously uncommented region -
- If the count drops all the way to zero, then you know that all of the differing columns exist in the commented list of columns - you have found all of them. This should be your goal in the first part of the process.
- Never start reducing the commented regions until you get a count of zero differences.
- Once you have a count of zero for the differences, start to work on each region in turn, by uncommenting half of it, and checking if the difference count remains zero.
- At the end of the process you want to be left with one or more regions of commented columns such that, uncommenting any column currently commented will make the query return a number greater than zero.
- If you reach this point, you have found the smallest number of columns that is responsible for the differences between the two result sets.
Most databases provide the multi-line commenting so that you can comment like this:
select count(1) from (
select
/*column_001
,column_002
,column_003
...
,*/column_051
,column_052
,column_053
...
,column_100
from except_ab
except
select
/*column_001
,column_002
,column_003
...
,*/column_051
,column_052
,column_053
...
,column_100
from except_ba
);
Next steps
Knowing which rows and columns differ, we can now more closely inspect the nature of the differences. One approach we can try is to aggregate over various sub groups in the data - we group over columns that do not differ, and we aggregate over values in columns that do differ. The cause of the differences might be due to something simple like a sign reversal, or a incorrect scaling operation, or it might be far more involved, but knowing which columns to inspect in the algorithm will help you a lot.
Establishing shape, or schema equivalence
As a start, we have to establish schema equivalence - the two data sets have to have the same “shape” else they cannot be equivalent.
In the context of a database, this would be the number of columns and their names and data types.
Even for very wide tables, we can fairly easily establish this by looking at the schema definition.
One approach would be to get the DDL for both data sets, the CREATE TABLE
statements, and compare the code side-by-side, or line-by-line.
Such a manual approach is only applicable if the schema is relatively small;
for larger schemas that would take too long to compare manually we can use diff tools.
When we know the schemas are equivalent, the next logical step would be to count rows - something as simple as:
select count(1) from table_a
union all
select count(1) from table_b;
Conclusion
It can be disheartening when, after a lot of hard work crafting a new algorithm or process, only to be greeted with differences of hundreds of thousands of rows. Fortunately though, my experience has been that in a big data analytics context, the pareto principal tends to hold in that, and that a minority of columns tend to be the cause of the majority of differences. In big data settings we often cannot simply eyeball the differences between data sets, mainly because of the ever-widening table schemas, and ever-growing data sizes. What we need is approaches that can scale, and that utilise the excellent features of the storage technology the data resides in, and this is probably some sort of stack that allows for set-based, SQL-like queries.
I hope that the method I shared here can help you find what you are looking for, faster.