published on in sql data analysis debugging anomalies methods

Finding vertical and horizontal, or row and column differences on wide tables, faster

Sometimes we need to show that the exact same data is produced by two different processes. We might be replacing one implementation of an algorithm with a more efficient one, or we might be running the same algorithm on a different technology. We want to show that the data in table_a and table_b is identical, or if it is not identical, we want to identify all differences. In this article, I want to share with you a method that helped me do this. SQL, being set-based, help us find differing rows easily and elegantly, but finding the differing columns requires a bit of careful and systematic manual work on our part.

Comparing on two dimensions

Assuming that the “shape” or schema of the two tables is equivalent (see notes below on how to determine this), we can consider how similar the contained data is by looking at the vertical (rows), and horizontal (columns) dimensions.

Vertical, or row difference

these possibilities exist:

  1. full disjunction : table_a and table_b differ completely and do not overlap at all
  2. non empty union : table_a and table_b partially overlap, but differ in all other rows
  3. table_a is entirely contained in table_b : table_a is a subset of table_b
  4. table_b is entirely contained in table_a : table_b is a subset of table_a
  5. set equivalence : table_a and table_b overlap completely and do not differ at all

It is worth noting that:

  • We have set equivalence if we have both table_a is a subset of table_b, AND table_b is a subset of table_a.
  • If row counts for the tables differ we can never have set equivalence.
  • Even if row counts for the tables differ, we can still potentially have non empty union or a subset scenario.
  • If row counts for the two tables are both N, , and the row count for the set intersection between the two tables is also N, then we have set equivalence.

To find differing/similar rows, you might be tempted to construct elaborate joins involving all the columns, perhaps using FULL JOIN, but there exist a more elegant approach which involves some of SQL’s set operators.

SQL set operators INTERSECT and EXCEPT

Most people are familiar with the UNION and UNION ALL set operators, but INTERSECT and EXCEPT is perhaps a little less known. As a reminder, UNION between two queries will produce the set union, with all duplicates removed, while UNION ALL simply unifies the two result sets into one, where there might exist duplicates. UNION is more expensive than UNION ALL because UNION performs a sort in order to do the implicit DISTINCT.

INTERSECT

For obtaining a possible set intersection between two tables, we can use the INTERSECT operator:

select * from table_a
intersect
select * from table_b
;

As pointed out earlier, if the row count for the intersection is the same as the row counts for the two tables, we have set equivalence:

select sum(row_count) = 0 as set_equivalence from (
select count(1) as row_count from (
select * from table_a
intersect
select * from table_b
)
union all
select count(1) as row_count from table_a
union all
select -2 * count(1) as row_count from table_b
);

EXCEPT

When we use the EXCEPT set operator, for example, query_1 EXCEPT query_2, we are left with all rows from query_1 that do not occur in query_2. Note that query_2 might contain rows that are not present in query_1, but because EXCEPT is not commutative, similar to how minus is not commutative, query_1 EXCEPT query_2 is not the same as query_2 EXCEPT query_1, and we are only left with non occurring rows in query_1.

select * from table_a
except
select * from table_b;

…is not the same as:

select * from table_b
except
select * from table_a;

Some points to note about EXCEPT:

  1. if the row counts for both tables are the same, then the row count for the result of applying the EXCEPT operator will always be equal to, or less than the row count for the tables.
  2. If the row counts for the tables are equal, and the row count for the EXCEPT query is equal to that number, then the two tables differ completely.
  3. If the row counts for the tables are equal), then the row count for an EXCEPT query between the two tables gives us the number of different rows between the two tables.
  4. If the row counts for the tables are equal, and the row count for the table_a EXCEPT table_b query is equal to zero, then it means that table_a is completely contained in table_b, but because the tables have the same number of rows, it must be that table_b and table_a are set equivalent. Obtaining the query results for both table_a EXCEPT table_b and table_b EXCEPT table_a is especially useful when the row counts for table_a and table_b are equivalent. Armed with those two query results, we can now move on to the task of determining in which columns the differences occur.

Horizontal, or column differences

When we have both table_a EXCEPT table_b and table_b EXCEPT table_a result sets, we can do a further EXCEPT query on them, in a special regime of commenting out columns, in order to determine which columns differ.

Setting things up for horizontal comparisons

Let’s call the result of table_a EXCEPT table_b view except_ab, and similarly table_b EXCEPT table_a we call view except_ba. except_ab and except_ba could be actual views, or results persisted in new tables. Persisting the result sets to actual tables is a better choice when the EXCEPT opperator has to work through large data sets.

Let’s further say that both table_a and table_b have columns column_001 through column_100 - not an uncommonly large number of columns when dealing with analytics type, column store data sets.

We want to find out what subset of columns cause except_ab and except_ba to differ. We start out by crafting a query of the form:

select count(1) from (
select
column_001
,column_002
,column_003
...
,column_100
from except_ab
except
select
column_001
,column_002
,column_003
...
,column_100
from except_ba
);

Of course, in practice, we rarely have such clean column names, but rather have all sorts of interesting and creative names to deal with. When we execute this query, we get the number of different rows - no surprises.

Performing a binary search over the column names

It is tempting at this point to start commenting out any column that appears suspicious… “That column_066 looks like a problem, let’s comment it out and run the query again” , you might reason:

select count(1) from (
select
column_001
,column_002
,column_003
...
--,column_066
...
,column_100
from except_ab
except
select
column_001
,column_002
,column_003
...
--,column_066
...
,column_100
from except_ba
);

Resist this urge in favor of a more systematic approach, which will lead you to the offending column(s) in the least amount of time. The basic idea of the systematic approach is not to try and identify individual columns, but rather finding regions of columns that contain differing column values, and then to shrink them one by one. Once we have identified all the column regions that are causing the count to be greater than zero, we can work on each of the regions in turn, to narrow down to the individual column level, which columns differ in values. When we found the regions that cause the two result sets to differ, we reduce those regions by halving them each time, until we end up with the smallest (in number of columns) sized regions that accounts for all the differences.

It is kind of awkward to describe the algorithm here in plain English, but hopefully you get the idea:

  1. Run the query and take note of the result - let’s call it $D$.
  2. Comment out half the columns of all currently uncommented columns. At the start, this would be from column_001 to column_050. Do this in both branches of the query.
  3. Run the query and take note of the new result - let’s call it $E$.
  4. If the count is the same as before, if $D = E$, it means that the offending columns definitely exist in the uncommented list of columns (and potentially also exist in the commented out ones).
    1. You can now try to invert the commented regions, and run the query again - start the process from step 1.
  5. If instead the count reduces to a number greater than zero, if $E > 0$, it means that you have found some of the differing columns: some of them are in the commented out list of columns.
    1. Now comment half of the previously uncommented region - column_051 through column_075, and run the query again (go to step 2.).
  6. If the count drops all the way to zero, then you know that all of the differing columns exist in the commented list of columns - you have found all of them. This should be your goal in the first part of the process.
  7. Never start reducing the commented regions until you get a count of zero differences.
  8. Once you have a count of zero for the differences, start to work on each region in turn, by uncommenting half of it, and checking if the difference count remains zero.
  9. At the end of the process you want to be left with one or more regions of commented columns such that, uncommenting any column currently commented will make the query return a number greater than zero.
  10. If you reach this point, you have found the smallest number of columns that is responsible for the differences between the two result sets.

Most databases provide the multi-line commenting so that you can comment like this:

select count(1) from (
select
/*column_001
,column_002
,column_003
...
,*/column_051
,column_052
,column_053
...
,column_100
from except_ab
except
select
/*column_001
,column_002
,column_003
...
,*/column_051
,column_052
,column_053
...
,column_100
from except_ba
);

Next steps

Knowing which rows and columns differ, we can now more closely inspect the nature of the differences. One approach we can try is to aggregate over various sub groups in the data - we group over columns that do not differ, and we aggregate over values in columns that do differ. The cause of the differences might be due to something simple like a sign reversal, or a incorrect scaling operation, or it might be far more involved, but knowing which columns to inspect in the algorithm will help you a lot.

Establishing shape, or schema equivalence

As a start, we have to establish schema equivalence - the two data sets have to have the same “shape” else they cannot be equivalent. In the context of a database, this would be the number of columns and their names and data types. Even for very wide tables, we can fairly easily establish this by looking at the schema definition. One approach would be to get the DDL for both data sets, the CREATE TABLE statements, and compare the code side-by-side, or line-by-line. Such a manual approach is only applicable if the schema is relatively small; for larger schemas that would take too long to compare manually we can use diff tools.

When we know the schemas are equivalent, the next logical step would be to count rows - something as simple as:

select count(1) from table_a
union all
select count(1) from table_b;

Conclusion

It can be disheartening when, after a lot of hard work crafting a new algorithm or process, only to be greeted with differences of hundreds of thousands of rows. Fortunately though, my experience has been that in a big data analytics context, the pareto principal tends to hold in that, and that a minority of columns tend to be the cause of the majority of differences. In big data settings we often cannot simply eyeball the differences between data sets, mainly because of the ever-widening table schemas, and ever-growing data sizes. What we need is approaches that can scale, and that utilise the excellent features of the storage technology the data resides in, and this is probably some sort of stack that allows for set-based, SQL-like queries.

I hope that the method I shared here can help you find what you are looking for, faster.