Thursday, February 11, 2021

Data Transformation for Cloud - dbt

 In this blog post we will focus on loading data from a valid source to a cloud data platform like snowflake. There are different tools available in order to do this, one of the tools that is gaining transaction is dbt,(https://docs.getdbt.com/). One of the main highlights of dbt is that it uses SQL for doing lot of the data transformations/loading into a cloud data warehouse like snowflake. There are certain additions that dbt has on top of using SQL that makes it very flexible for the ETL/ELT purposes. There are 2 ways to use dbt, one is to use the Command Line Interface, the other one is dbt cloud. There are lot of configurations available which can be set up to make the data transformation process efficient and effective. The core concept involved in dbt is called the models. dbt uses models extensively to create table/views on the cloud data warehouse. The order of creating the tables and views in the cloud data warehouse is taken care of dbt by using the concept of models. Models allows one to define the base objects and relationships.

In order to connect to the different data sources, there adapters available that dbt provides. These adapters allows dbt to connect to the datasource and load data into the target cloud data warehouse. For a list of the adapters available, please check the following link: https://docs.getdbt.com/docs/available-adapters. The adapters are primarily for cloud data warehouses/data lakes like Snowflake,Redshift,Bigquery. In order to start using dbt one has to create dbt project, to quote from dbt: A dbt project is a directory of .sql and .yml files, which dbt uses to transform your data. 

https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.

Typically in a ETL/ELT operation there are some considerations that need to be taken into account for loading data:

1. Is the data load into the cloud data warehouse going to be full refresh, if so how many tables follow this loading type.
2. Is the data load into the cloud data warehouse going to be incremental? If so how many tables follow this loading type.
3. Are there going to be materialized views that need to be created?
4. Is the warehouse going to have slowly changing dimension tables?
5. How are the relationships going to be defined.

Based on the on the above factors and the need of the business, all of the above choices can be implemented in dbt. When one has lot of data being sourced and needs to be used for analytic purposes, it is not possible to do full refreshes everyday. One might have to look at loading the data incrementally, to meet the SLAs and have improved performance. dbt uses the concept of snapshot in order to determine source freshness, this tells the use if the data at source has been updated and can be pulled into the data store. Quoting from dbt website: "This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse."
I hope you find the information here useful, designing proper data load and transformation strategies is key to having good data pipelines.



No comments:

Post a Comment