Skip to content

Staging

alt text

Watch the video

Prefer a video? This video has a great overview of the content on this page.

Assumptions

  1. The dbtvault package assumes you've already loaded a table with raw data from a source system or feed; this is referred to as the 'raw staging layer' or 'landing zone'.

  2. All records in a single load must be for the same load datetime. This restriction is not applicable to Hubs and Links. We will be removing this restriction for other structures in the future.

Getting started

The raw staging table needs to be prepared with additional columns so that we may load our raw vault. Specifically, we need to add hash keys, hashdiffs, and any implied fixed-value columns (see the above diagram).

We also need to ensure column names align with target Hub or Link table column names.

Creating staging models

To create a stage model, we simply copy and paste the below template into a model named after the staging table/view we are creating. dbtvault will generate a stage using parameters provided in the next steps.

1
2
3
4
5
6
{{ dbtvault.stage(include_source_columns=true,
                  source_model=source_model,
                  derived_columns=derived_columns,
                  null_columns=null_columns,
                  hashed_columns=hashed_columns,
                  ranked_columns=ranked_columns) }}

Materialisation

The recommended materialisation for a stage is view, as the stage layer contains minimal transformations on the raw staging layer which need to remain in sync. You may materialise some or all stages as tables if necessary, though this can increase costs significantly for large amounts of data.

Adding the metadata

Let's look at the metadata we need to provide to the stage macro.

Source model

The "source model" for a stage does not necessarily need to be a model. This means you do not need to manage the raw stage via dbt models if you don't need to, and can simply reference tables in your database via a dbt source.

The model provided in the 'Final model' section below, shows the use of the 'source style' source_model syntax.

Derived columns
Column Name Value
SOURCE !1
LOAD_DATETIME CRM_DATA_INGESTION_TIME
EFFECTIVE_FROM BOOKING_DATE
START_DATE BOOKING_DATE
END_DATE TO_DATE('9999-12-31')

What is the '!'?

This is some syntactic sugar provided by dbtvault to create constant values. Read More

Hashed columns
Column Name Value
CUSTOMER_HK CUSTOMER_ID
NATION_HK NATION_ID
CUSTOMER_NATION_HK CUSTOMER_ID, NATION_ID
CUSTOMER_HASHDIFF is_hashdiff: true
columns: CUSTOMER_NAME, CUSTOMER_ID, CUSTOMER_PHONE, CUSTOMER_DOB
Final Model

When we provide the metadata above, our model should look like the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
{{ config(materialized='view') }}

{%- set yaml_metadata -%}
source_model: 
  raw_staging: "raw_customer"
derived_columns:
  SOURCE: "!1"
  LOAD_DATETIME: "CRM_DATA_INGESTION_TIME"
  EFFECTIVE_FROM: "BOOKING_DATE"
  START_DATE: "BOOKING_DATE"
  END_DATE: "TO_DATE('9999-12-31')"
hashed_columns:
  CUSTOMER_HK: "CUSTOMER_ID"
  NATION_HK: "NATION_ID"
  CUSTOMER_NATION_HK:
    - "CUSTOMER_ID"
    - "NATION_ID"
  CUSTOMER_HASHDIFF:
    is_hashdiff: true
    columns:
      - "CUSTOMER_NAME"
      - "CUSTOMER_ID"
      - "CUSTOMER_PHONE"
      - "CUSTOMER_DOB"
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ dbtvault.stage(include_source_columns=true,
                  source_model=metadata_dict['source_model'],
                  derived_columns=metadata_dict['derived_columns'],
                  null_columns=none,
                  hashed_columns=metadata_dict['hashed_columns'],
                  ranked_columns=none) }}

Note

See our metadata reference for more detail on how to provide metadata to stages.

Summary

In summary this model will:

  • Be materialized as a view
  • Select all columns from the external data source raw_customer
  • Generate hashed columns to create hash keys and a hashdiff
  • Generate a SOURCE column with the constant value 1
  • Generate an EFFECTIVE_FROM column derived from the BOOKING_DATE column present in the raw data.
  • Generate START_DATE and END_DATE columns for use in the effectivity satellites later on.

Using the staging macro

Take a look at the stage section of the macro documentation for a more in-depth look at what you can do with the stage macro

Running dbt

With our model complete and our YAML written, we can run dbt:

dbt run -m v_stg_orders

dbt run -s v_stg_orders

The resulting stage view will look like this:

CUSTOMER_HK NATION_HK CUSTOMER_NATION_HK CUSTOMER_HASHDIFF (source table columns) LOAD_DATETIME SOURCE EFFECTIVE_FROM START_DATE END_DATE
B8C37E... D89F3A... 72A160... . . 1993-01-01 00:00:00.000 1 1993-01-01 1993-01-01 9999-12-31
. . . . . . 1 . . .
. . . . . . 1 . . .
FED333... D78382... 1CE6A9... . . 1993-01-01 00:00:00.000 1 1993-01-01 1993-01-01 9999-12-31

Next steps

Now that we have implemented a new staging layer with all the required fields and hashes, we can start loading our vault.

Source