Friday, August 23, 2019

Using Hashbytes in ETL

We use Solver's BI360 for our internal data warehousing.  It works fairly well for us and is a nice hybrid solution to DW.  This is not a review.

As background I was trained by Solver tech's back in 2016 to configure the ETL that moves data into the BI360 DW.  They had pre-built SQL Server Integration Services (SSIS) ETL packages for our ERP solution (Dynamics GP) and we modified them to work in our environment.

One key element of the ETL was the use of the SQL BINARY_CHECKSUM (BC) function to detect changes in data.  It is not practical to compare every column in the DW to all the columns in the source data so it makes sense to hash them and compare the hashes instead.

As a Google search will quickly show BC is not wholly reliable in detecting certain types of changes.  For example I received an email from a user asking why an order line was not updated in our DW reports.  The order line was originally entered as quantity 1 for $2,900 each and was changed to quantity 10 at $290 each.  This change did not propagate to the DW and I traced it to the BC function.  BC clearly did not see the difference between these two values.

This was SERIOUSLY not acceptable so I turned again to Google to find alternatives.  While there may be others I settled on the HASHBYTES (HB) function.  It identified the difference above that BC did not.

It is common to use the CONCAT function with HB to build the hash value.  So I listed the columns inside CONCAT and used that as the input parameter to HB.  Further exploration of the use of HB revealed that blank column values might not be detected in HB.  For this reason it was recommended to use a field separator value - I chose | for this purpose.

There was a side-benefit to using this construct.  I suppose I could have tried to used CONCAT inside BC and seen if the results were different but I didn't.  CONCAT has the benefit of being able to easily compare the source and target strings.  This is important in determining if I've built the comparison correctly and was very difficult to track down with how I was using BC.

So now when my updates are not completing I can grab the source and target CONCAT strings and compare them to find where I've messed up.  This has worked great so far and I'm looking forward to using this moving forward.  All my ETL is now converted to HB and BC is no more.


No comments:

Post a Comment