The dbtvault package assumes you've already loaded a Snowflake database staging table with raw data from a source system or feed (the 'raw staging layer').
All records in a single load must be for the same for the same load datetime. This restriction is not applicable to Hubs and Links. We will soon be removing this restriction for T-Links, Satellites and Effectivity Satellites.
The raw staging table needs to be prepared with additional columns so that we may load our raw vault. Specifically, we need to add primary key hashes, hashdiffs, and any implied fixed-value columns (see the diagram).
We also need to ensure column names align with target hub or link table column names.
Hashing of primary keys is optional in Snowflake and natural keys alone can be used in place of hashing.
We've implemented hashing as the only option for now, though a non-hashed version will be added in future releases, checkout our roadmap.
Creating the stage model¶
To prepare our raw staging layer, we create a dbt model and call the dbtvault stage macro with provided metadata.
Setting up staging models¶
First we create a new dbt model. Our example source table is called
raw_orders, and in this scenario contains data about customers and orders.
We should name our staging model sensibly, for example
stg_orders_hashed.sql, although any consistent and sensible naming convention will work.
1 2 3 4
To create a staging model, we simply copy and paste the above template into a model named after the staging table/view we are creating. We provide the metadata to this template, which will use them to generate a staging layer.
Staging models should use the
view materialization, though it can be a
table depending on your requirements.
We recommend setting the
view materialization on all of your staging models using the
1 2 3 4 5 6 7 8 9 10 11 12
Adding the metadata¶
Let's look at the metadata we need to provide to the stage macro.
The first piece of metadata we need is the source name. This can be in dbt
source style, or
Generally speaking, our source for staging will be an external raw source of data, so we should set up
a dbt source and use the
We will assume you have opted to use the
source style for the remainder of the staging tutorial.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4
Adding hashed columns¶
We can now specify a mapping of columns to hash, which we will use in our raw vault layer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
With this metadata, the stage macro will:
- Hash the
CUSTOMER_IDcolumn, and create a new column called
CUSTOMER_PKcontaining the hash value.
- Hash the
NATION_IDcolumn, and create a new column called
NATION_PKcontaining the hash value.
- Concatenate the values in the
NATION_IDcolumns and hash them in the order supplied, creating a new column called
CUSTOMER_NATION_PKcontaining the hash of the combination of the values.
- Concatenate the values in the
CUSTOMER_DOBcolumns and hash them, creating a new column called
CUSTOMER_HASHDIFFcontaining the hash of the combination of the values. The
is_hashdiff: trueflag should be provided so that dbtvault knows to treat this column as a hashdiff. Treating this column as a hashdiff means dbtvault with automatically sort the columns prior to hashing.
See Why do we hash? for details on hashing best practises.
Adding calculated and derived columns¶
We can also provide a mapping of derived, calculated or constant columns which will be needed for the raw vault but which do not already exist in the raw data.
Some of these columns may be 'constants' implied by the context of the staging data.
For example, we could add a source table code value for audit purposes, or a load date which is the result of a function such as
We provide a constant by prepending a
! to the front of the value in the key/value pair.
Read more about constants
For full options, usage examples and syntax, please refer to the stage macro documentation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
By default, the stage macro will automatically select all columns which exist in the source model, unless
include_source_columns macro is set to
In summary this model will:
- Be materialized as a view
- Select all columns from the external data source
- Generate hashed columns to create primary keys and a hashdiff
- Generate a
SOURCEcolumn with the constant value
- Generate an
EFFECTIVE_FROMcolumn derived from the
BOOKING_DATEcolumn present in the raw data.
END_DATEcolumns for use in the effectivity satellites later on.
With our model complete and our YAML written, we can run dbt:
dbt run -m stg_customer_hashed
And our table will look like this:
|CUSTOMER_PK||NATION_PK||CUSTOMER_NATION_PK||CUSTOMER_HASHDIFF||(source table columns)||SOURCE||EFFECTIVE_FROM||START_DATE||END_DATE|
Now that we have implemented a new staging layer with all the required fields and hashes, we can start loading our vault.