Dimensional modeling is a time-tested method to constructing analytics-ready knowledge warehouses. Whereas many organizations are shifting to fashionable platforms like Databricks, these foundational strategies nonetheless apply.
In Half 1, we designed our dimensional schema. In Half 2, we constructed ETL pipelines for dimension tables. Now in Half 3, we implement the ETL logic for truth tables, emphasizing effectivity and integrity.
Truth tables and delta extracts
In the primary weblog, we outlined the very fact desk, FactInternetSales, as proven beneath. In comparison with our dimension tables, the very fact desk is comparatively slender when it comes to file size, with solely international key references to our dimension tables, our truth measures, our degenerate dimension fields and a single metadata discipline current:
NOTE: Within the instance beneath, we’ve altered the CREATE TABLE assertion from our first submit to incorporate the international key definitions as a substitute of defining these in separate ALTER TABLE statements. We’ve additionally included a major key constraint on the degenerate dimension fields to be extra specific about their function extra specific on this truth desk.
The desk definition is pretty simple, but it surely’s value taking a second to debate the LastModifiedDateTime metadata discipline. Whereas truth tables are comparatively slender when it comes to discipline rely, they are usually very deep when it comes to row rely. Truth tables typically home hundreds of thousands, if not billions, of information, typically derived from high-volume operational actions. As a substitute of trying to reload the desk with a full extract on every ETL cycle, we are going to usually restrict our efforts to new information and people which were modified.
Relying on the supply system and its underlying infrastructure, there are a lot of methods to establish which operational information must be extracted with a given ETL cycle. Change knowledge seize (CDC) capabilities applied on the operational facet are essentially the most dependable mechanisms. However when these are unavailable, we regularly fall again to timestamps recorded with every transaction file as it’s created and modified. The method just isn’t bulletproof for change detection, however as any skilled ETL developer will attest, it’s typically the most effective we’ve bought.
NOTE: The introduction of Lakeflow Join supplies an fascinating choice for performing change knowledge seize on relational databases. This functionality is in preview on the time of the writing of this weblog. Nonetheless, as the potential matures to increase increasingly more RDBMSs, we anticipate this to supply an efficient and environment friendly mechanism for incremental extracts.
In our truth desk, the LastModifiedDateTime discipline captures such a timestamp worth recorded within the operational system. Earlier than extracting knowledge from our operational system, we are going to assessment the very fact desk to establish the most recent worth for this discipline we’ve recorded. That worth would be the start line for our incremental (aka delta) extract.
The Truth ETL workflow
The high-level workflow for our truth ETL will proceed as follows:
- Retrieve the most recent LastModifiedDateTime worth from our truth desk.
- Extract related transactional knowledge from the supply system with timestamps on or after the most recent LastModifiedDateTime worth.
- Carry out any further knowledge cleaning steps required on the extracted knowledge.
- Publish any late-arriving member values to the related dimensions.
- Lookup international key values from related dimensions.
- Publish knowledge to the very fact desk.
To make this workflow simpler to digest, we’ll describe its key phases within the following sections. In contrast to the submit on dimension ETL, we are going to implement our logic for this workflow utilizing a mixture of SQL and Python based mostly on which language makes every step most simple to implement. Once more, one of many strengths of the Databricks Platform is its assist for a number of languages. As a substitute of presenting it as an all-or-nothing alternative made on the prime of an implementation, we are going to present how knowledge engineers can shortly pivot between the 2 inside a single implementation.
Steps 1-3: Delta extract section
Our workflow’s first two steps give attention to extracting new and newly up to date info from our operational system. In step one, we do a easy lookup of the most recent recorded worth for LastModifiedDateTime. If the very fact desk is empty, correctly upon initialization, we outline a default worth that’s far sufficient again in time that we imagine it is going to seize all of the related knowledge within the supply system:
We will now extract the required knowledge from our operational system utilizing that worth. Whereas this question contains fairly a little bit of element, focus your consideration on the WHERE clause, the place we make use of the final noticed timestamp worth from the earlier step to retrieve the person line objects which might be new or modified (or related to gross sales orders which might be new or modified):
As earlier than, the extracted knowledge is endured to a desk in our staging schema, solely accessible to our knowledge engineers, earlier than continuing to subsequent steps within the workflow. If we now have any further knowledge cleaning to carry out, we should always accomplish that now.
Step 4: Late arriving members section
The standard sequence in an information warehouse ETL cycle is operating our dimension ETL workflows after which our truth workflows shortly after. By organizing our processes this manner, we will higher guarantee all the data required to attach our truth information to dimension knowledge will probably be in place. Nonetheless, there’s a slender window inside which new, dimension-oriented knowledge arrives and is picked up by a fact-relevant transactional file. That window will increase ought to we now have a failure within the total ETL cycle that delays truth knowledge extraction. And, in fact, there can at all times be referential failures in supply methods that permit questionable knowledge to look in a transactional file.
To insulate ourselves from this drawback, we are going to insert right into a given dimension desk any enterprise key values present in our staged truth knowledge however not within the set of present (unexpired) information for that dimension. This method will create a file with a enterprise (pure) key and a surrogate key that our truth desk can reference. These information will probably be flagged as late arriving if the focused dimension is a Kind-2 SCD in order that we will replace appropriately on the subsequent ETL cycle.
To get us began, we are going to compile an inventory of key enterprise fields in our staging knowledge. Right here, we’re exploiting strict naming conventions that permit us to establish these fields dynamically:
NOTE: We’re switching to Python for the next code examples. Databricks helps using a number of languages, even inside the identical workflow. On this instance, Python provides us a bit extra flexibility whereas nonetheless aligning with SQL ideas, making this method accessible to extra conventional SQL builders.
Discover that we now have separated our date keys from the opposite enterprise keys. We’ll return to these in a bit, however for now, let’s give attention to the non-date (different) keys on this desk.
For every non-date enterprise key, we will use our discipline and desk naming conventions to establish the dimension desk that ought to maintain that key after which carry out a left-semi be part of (just like a NOT IN() comparability however supporting multi-column matching if wanted) to establish any values for that column within the staging desk however not within the dimension desk. After we discover an unmatched worth, we merely insert it into the dimension desk with the suitable setting for the IsLateArriving discipline:
This logic would work wonderful for our date dimension references if we wished to make sure our truth information linked to legitimate entries. Nonetheless, many downstream BI methods implement logic that requires the date dimension to deal with a steady, uninterrupted collection of dates between the earliest and newest values recorded. Ought to we encounter a date earlier than or after the vary of values within the desk, we’d like not simply to enter the lacking member however create the extra values required to protect an unbroken vary. For that motive, we’d like barely completely different logic for any late arrival dates:
In case you have not labored a lot with Databricks or Spark SQL, the question on the coronary heart of this final step is probably going international. The sequence() operate builds a sequence of values based mostly on a specified begin and cease. The result’s an array that we will then explode (utilizing the explode() operate) so that every aspect within the array types a row in a consequence set. From there, we merely evaluate the required vary to what’s within the dimension desk to establish which components must be inserted. With that insertion, we guarantee we now have a surrogate key worth applied on this dimension as a sensible key in order that our truth information could have one thing to reference.
Steps 5 – 6: Information publication section
Now that we may be assured that every one enterprise keys in our staging desk may be matched to information of their corresponding dimensions, we will proceed with the publication to the very fact desk.
Step one on this course of is to lookup the international key values for these enterprise keys. This may be executed as a part of a single publication step, however the giant variety of joins within the question typically makes this method difficult to take care of. For that reason, we would take the much less environment friendly however easier-to-comprehend and modify the method of trying up international key values one enterprise key at a time and appending these values to our staging desk:
Once more, we’re exploiting naming conventions to make this logic extra simple to implement. As a result of our date dimension is a role-playing dimension and due to this fact follows a extra variable naming conference, we implement barely completely different logic for these enterprise keys.
At this level, our staging desk homes enterprise keys and surrogate key values together with our measures, degenerate dimension fields, and the LastModifiedDate worth extracted from our supply system. To make publication extra manageable, we should always align the obtainable fields with these supported by the very fact desk. To try this, we have to drop the enterprise keys:
NOTE: The supply dataframe is outlined within the earlier code block.
With the fields aligned, the publication step is simple. We match our incoming information to these within the truth desk based mostly on the degenerate dimension fields, which function a novel identifier for our truth information, after which replace or insert values as wanted:
Subsequent steps
We hope this weblog collection has been informative to these in search of to construct dimensional fashions on the Databricks Platform. We anticipate that many skilled with this knowledge modeling method and the ETL workflows related to it is going to discover Databricks acquainted, accessible and able to supporting long-established patterns with minimal modifications in comparison with what might have been applied on RDBMS platforms. The place modifications emerge, akin to the flexibility to implement workflow logic utilizing a mixture of Python and SQL, we hope that knowledge engineers will discover this makes their work extra simple to implement and assist over time.
To study extra about Databricks SQL, go to our web site or learn the documentation. It’s also possible to try the product tour for Databricks SQL. Suppose you wish to migrate your present warehouse to a high-performance, serverless knowledge warehouse with an amazing person expertise and decrease whole value. In that case, Databricks SQL is the answer — attempt it at no cost.