Ram's Blog: February 2021

Friday, February 26, 2021

Data Management-Agile Methodology

Data today is spread everywhere in organization, the rate at which data is incoming into an organization is increasing at a very rapid rate, with various touch points to collect data. Technology is also evolving at a rapid pace, so it has/is becoming very important for organizations to streamline data ingestion and catalog them appropriately. With Data strategy being closely tied with Business strategy, it has become very critical to deliver Business value effectively and quickly. In the past years there was a traditional approach being followed for data management. In this approach there were long lead times and the end product delivered was either late or did not meet the requirements. Fast Forward we are now in the age of agile delivery, devops, Continuous deployment and Integration. In the agile world the focus is to deliver incremental business value incrementally. How do we tackle data projects in the agile world, it is not cut and dry in the data world, there are lot of dependencies both on source systems, also there are provisioning systems that need data within SLA's. In order to address some of these challenges, there some key points that can be incorporated plus make use of tools that incorporate AI techniques.

1. Leadership should embrace Agile top down for Data Projects and there should be bottom up feedback on how agile is working for these projects.
2. Leaders/Business partners should provide framework, remove roadblocks, runways that would help an organization adopt Agile. There should be a mindset to tear down methods that wouldn't work in a modern enterprise, both business/technology should come closer together to deliver solutions.

3. Collaboration should be nurtured, allow the business and technology conversations to happen. There will be role specific responsibilities but that should not provide a roadblock to agile adoption.

4. Budgeting of activities/work need to change to adopt techniques like Activity Based Costing so that features/epics/deliverables can be funded accordingly.

5. Architecture needs to speed up adoption of latest generation Data Management Tools like Atlan,Arena and DQLabs in order to facilitate more efficient data ingestion, data profiling/quality and build effective lineage.

6. Availability of quality test data or have a framework to generate test data efficiently, this is real key to move along the work in the agile pipeline and have work ready for deployment. A key part of this is to have the ability to obfuscate the data especially when working with sensitive information.

7. One of the key aspects of modern data platforms is to have the metadata/catalog evolve alongside the data pipelines that are being built. In such scenarios the right set of data management tools can reduce technical debt. This is a very crucial aspect, identifying and handling this can limit the amount of work to fix data gap issues.

8. All of the above points need to come together so that you can evolve the data platform and data management in a agile way and match to the speed of the business. Business, Technology, Architecture, Stakeholders all have a role/responsibility to make Agile Data Management happen.

Thursday, February 11, 2021

Data Transformation for Cloud - dbt

In this blog post we will focus on loading data from a valid source to a cloud data platform like snowflake. There are different tools available in order to do this, one of the tools that is gaining transaction is dbt,(https://docs.getdbt.com/). One of the main highlights of dbt is that it uses SQL for doing lot of the data transformations/loading into a cloud data warehouse like snowflake. There are certain additions that dbt has on top of using SQL that makes it very flexible for the ETL/ELT purposes. There are 2 ways to use dbt, one is to use the Command Line Interface, the other one is dbt cloud. There are lot of configurations available which can be set up to make the data transformation process efficient and effective. The core concept involved in dbt is called the models. dbt uses models extensively to create table/views on the cloud data warehouse. The order of creating the tables and views in the cloud data warehouse is taken care of dbt by using the concept of models. Models allows one to define the base objects and relationships.

In order to connect to the different data sources, there adapters available that dbt provides. These adapters allows dbt to connect to the datasource and load data into the target cloud data warehouse. For a list of the adapters available, please check the following link: https://docs.getdbt.com/docs/available-adapters. The adapters are primarily for cloud data warehouses/data lakes like Snowflake,Redshift,Bigquery. In order to start using dbt one has to create dbt project, to quote from dbt: A dbt project is a directory of .sql and .yml files, which dbt uses to transform your data.

https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.

Typically in a ETL/ELT operation there are some considerations that need to be taken into account for loading data:

1. Is the data load into the cloud data warehouse going to be full refresh, if so how many tables follow this loading type.
2. Is the data load into the cloud data warehouse going to be incremental? If so how many tables follow this loading type.

3. Are there going to be materialized views that need to be created?

4. Is the warehouse going to have slowly changing dimension tables?

5. How are the relationships going to be defined.

Based on the on the above factors and the need of the business, all of the above choices can be implemented in dbt. When one has lot of data being sourced and needs to be used for analytic purposes, it is not possible to do full refreshes everyday. One might have to look at loading the data incrementally, to meet the SLAs and have improved performance. dbt uses the concept of snapshot in order to determine source freshness, this tells the use if the data at source has been updated and can be pulled into the data store. Quoting from dbt website: "This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse."

I hope you find the information here useful, designing proper data load and transformation strategies is key to having good data pipelines.

Monday, February 8, 2021

Data Pipelines/Data Transformation

Data is available in abundance in organizations, more so in bigger companies. How do they make good use of the data to get valuable insights and add business value is the key strategic question in companies today. Early on there were lot of data warehouses, data marts built using complex ETL techniques, then there was this concept of ELT (Extraction, Loan and Transform) which was used to transport data. With the advent of Big data, AI/ML techniques there has been continuous need to improve the data integration in order to have successful data projects. One of the challenges companies have had is how to handle the flow of data in order to gain maximum value. There have been approaches such have one set of folks help with the sourcing of data, another set to transform and finally have the data science folks figure out the value. This approach has caused too many hand off points, and also caused lot of silos of expertise. In order to address this problem, a new concept is being used lot of organizations recently, it is called the Data Pipeline.

What is a Data Pipeline: It is the complete end to end process of getting the data from source and build the complete lifecycle (Source, transform, data quality checks, AI/ML model generation and Data visualization). Using this concept, resources are now being engaged to manage the complete pipeline, Data scientists/Analyst are being encouraged to manage and/own data pipelines. There are different tools available that help you manage the data pipeline. In some cases the data pipeline is also referred as workflows. The tools that are available today that help with these concepts are dbt:getdbt.com (Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.), please refer to https://docs.getdbt.com/docs/introduction.

Matillion,https://www.matillion.com/, Atlan (Data Management Tool), Arena from Zaloni and host of other tools like Collibra. The tools here help provide a complete end to end perspective of the data that is being used for different types of data projects. The tools mentioned above are cloud ready, with companies making the move to the cloud to store all of the data, the adoption of the above tools hopefully is much easier.

When a data scientist and/or analysts has the exposure to the tools mentioned above, they get a complete perspective of the data, understand the lineage which would help in building out better AI/ML Models that can be used by the companies. Lot of AI/Ml efforts fail because of bad data, unable to understand the lineage and dependencies. One of the key aspects to keep in mind while working on data projects is Data Drift, what does this mean? Data is not static, it keeps changing constantly, the structure of the data could change. There can be changes in schema, the granularity of data and the volume of data can keep fluctuating. When you have the tools that have been mentioned in the blog post earlier, they help understanding these changes to a great extent and help tuning the AI/ML Models. There is a lot more research made in the are of data drift and related tools. I will update on those in a different blog post.

Ram's Blog