Monday, February 8, 2021

Data Pipelines/Data Transformation

 Data is available in abundance in organizations, more so in bigger companies. How do they make good use of the data to get valuable insights and add business value is the key strategic question in companies today. Early on there were lot of data warehouses, data marts built using complex ETL techniques, then there was this concept of ELT (Extraction, Loan and Transform) which was used to transport data. With the advent of Big data, AI/ML techniques there has been continuous need to improve the data integration in order to have successful data projects. One of the challenges companies have had is how to handle the flow of data in order to gain maximum value. There have been approaches such have one set of folks help with the sourcing of data, another set to transform and finally have the data science folks figure out the value. This approach has caused too many hand off points, and also caused lot of silos of expertise. In order to address this problem, a new concept is being used lot of organizations recently, it is called the Data Pipeline.

What is a Data Pipeline: It is the complete end to end process of getting the data from source and build the complete lifecycle (Source, transform, data quality checks, AI/ML model generation and Data visualization). Using this concept, resources are now being engaged to manage the complete pipeline, Data scientists/Analyst are being encouraged to manage and/own data pipelines. There are different tools available that help you manage the data pipeline. In some cases the data pipeline is also referred as workflows. The tools that are available today that help with these concepts are (Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.), please refer to

Matillion,, Atlan (Data Management Tool), Arena from Zaloni and host of other tools like Collibra. The tools here help provide a complete end to end perspective of the data that is being used for different types of data projects. The tools mentioned above are cloud ready, with companies making the move to the cloud to store all of the data, the adoption of the above tools hopefully is much easier.

When a data scientist and/or analysts has the exposure to the tools mentioned above, they get a complete perspective of the data, understand the lineage which would help in building out better AI/ML Models that can be used by the companies. Lot of AI/Ml efforts fail because of bad data, unable to understand the lineage and dependencies. One of the key aspects to keep in mind while working on data projects is Data Drift, what does this mean? Data is not static, it keeps changing constantly, the structure of the data could change. There can be changes in schema, the granularity of data and the volume of data can keep fluctuating. When you have the tools that have been mentioned in the blog post earlier, they help understanding these changes to a great extent and help tuning the AI/ML Models. There is a lot more research made in the are of data drift and related tools. I will update on those in a different blog post.

No comments:

Post a Comment