Monday, April 5, 2021

Data Processing Unified - Omniscope

 There are lot of tools coming out in the market related to data management, analytics and visualization. It is also becoming harder for data practitioners to choose tools, so it is natural that we also expect to see consolidation of tools that handle data sourcing, management, transformation and visualization. One such tool that i got to see demonstrated is Omniscope. The link to get to see what the tool offer, please see here:https://visokio.com/. The tools goal is to unify the following components of a data journey

  1. Data Sourcing
  2. Data Blending - This is the step where you can combine data from different data sources. This will be very handy when say you are trying to combine customer information that is on a system different from Borrower/Loan information. This is an example, you will have scenarios where you have to perform this step.
  3. Data Transformation - This is the step where you take the source or the blended data and add additional attributes/calculations based on your end business goal.
  4. Data Lineage - The functionality that provides the journey of a data element as it flows from source to different target systems. This helps in also understanding how a data element is used in different calculations/reports.
  5. Data Visualization - The layer that displays the data in different forms (like graphs, charts, plots)

The tool helps data curators/scientists/analyst to maintain the pipeline of data coming through and how it is to be consumed by users. There are plenty of visualizations available in this tool and the color schemes used are very good. This tool also attempts to maintain lineage of the data that is used for the report, thereby providing an opportunity for end to end governance. In addition to all of these, there are features related to scheduling and automation. Please refer to the link above for more in depth look at the features of this tool.


Wednesday, March 17, 2021

Data Catalog/Documentation

One of the key aspects of Data Governance is the ability to have very good data documentation. Here we are discussing about the nature of data being collected. It is very important to have this information documented as this can help determining how the data can be used in various applications, also in building Machine Learning Models. I am including here a very good blog post written by Prukalpa Shankar (Co Founder of Atlan) . In this article the 5W1H framework is discussed extensively: What is the 5W1H, 

  1. WHAT
  2. WHO
  3. WHY
  4. WHEN
  5. WHERE
  6. HOW

https://towardsdatascience.com/data-documentation-woes-heres-a-framework-6aba8f20626c

Excellent insights are provided in the above post, please read and Keep Learning.

Friday, February 26, 2021

Data Management-Agile Methodology

 Data today is spread everywhere in organization, the rate at which data is incoming into an organization is increasing at a very rapid rate, with various touch points to collect data. Technology is also evolving at a rapid pace, so it has/is becoming very important for organizations to streamline data ingestion and catalog them appropriately. With Data strategy being closely tied with Business strategy, it has become very critical to deliver Business value effectively and quickly. In the past years there was a traditional approach being followed for data management. In this approach there were long lead times and the end product delivered was either late or did not meet the requirements. Fast Forward we are now in the age of agile delivery, devops, Continuous deployment and Integration. In the agile world the focus is to deliver incremental business value incrementally. How do we tackle data projects in the agile world, it is not cut and dry in the data world, there are lot of dependencies both on source systems, also there are provisioning systems that need data within SLA's. In order to address some of these challenges, there some key points that can be incorporated plus make use of tools that incorporate AI techniques.

1. Leadership should embrace Agile top down for Data Projects and there should be bottom up feedback on how agile is working for these projects.
2. Leaders/Business partners should provide framework, remove roadblocks, runways that would help an organization adopt Agile. There should be a mindset to tear down methods that wouldn't work in a modern enterprise, both business/technology should come closer together to deliver solutions.
3. Collaboration should be nurtured, allow the business and technology conversations to happen. There will be role specific responsibilities but that should not provide a roadblock to agile adoption.
4. Budgeting of activities/work need to change to adopt techniques like Activity Based Costing so that features/epics/deliverables can be funded accordingly.
5. Architecture needs to speed up adoption of latest generation Data Management Tools like Atlan,Arena and DQLabs in order to facilitate more efficient data ingestion, data profiling/quality and build effective lineage.
6. Availability of quality test data or have a framework to generate test data efficiently, this is real key to move along the work in the agile pipeline and have work ready for deployment. A key part of this is to have the ability to obfuscate the data especially when working with sensitive information.
7. One of the key aspects of modern data platforms is to have the metadata/catalog evolve alongside the data pipelines that are being built. In such scenarios the right set of data management tools can reduce technical debt. This is a very crucial aspect, identifying and handling this can limit the amount of work to fix data gap issues.
8. All of the above points need to come together so that you can evolve the data platform and data management in a agile way and match to the speed of the business. Business, Technology, Architecture, Stakeholders all have a role/responsibility to make Agile Data Management happen.

Thursday, February 11, 2021

Data Transformation for Cloud - dbt

 In this blog post we will focus on loading data from a valid source to a cloud data platform like snowflake. There are different tools available in order to do this, one of the tools that is gaining transaction is dbt,(https://docs.getdbt.com/). One of the main highlights of dbt is that it uses SQL for doing lot of the data transformations/loading into a cloud data warehouse like snowflake. There are certain additions that dbt has on top of using SQL that makes it very flexible for the ETL/ELT purposes. There are 2 ways to use dbt, one is to use the Command Line Interface, the other one is dbt cloud. There are lot of configurations available which can be set up to make the data transformation process efficient and effective. The core concept involved in dbt is called the models. dbt uses models extensively to create table/views on the cloud data warehouse. The order of creating the tables and views in the cloud data warehouse is taken care of dbt by using the concept of models. Models allows one to define the base objects and relationships.

In order to connect to the different data sources, there adapters available that dbt provides. These adapters allows dbt to connect to the datasource and load data into the target cloud data warehouse. For a list of the adapters available, please check the following link: https://docs.getdbt.com/docs/available-adapters. The adapters are primarily for cloud data warehouses/data lakes like Snowflake,Redshift,Bigquery. In order to start using dbt one has to create dbt project, to quote from dbt: A dbt project is a directory of .sql and .yml files, which dbt uses to transform your data. 

https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.

Typically in a ETL/ELT operation there are some considerations that need to be taken into account for loading data:

1. Is the data load into the cloud data warehouse going to be full refresh, if so how many tables follow this loading type.
2. Is the data load into the cloud data warehouse going to be incremental? If so how many tables follow this loading type.
3. Are there going to be materialized views that need to be created?
4. Is the warehouse going to have slowly changing dimension tables?
5. How are the relationships going to be defined.

Based on the on the above factors and the need of the business, all of the above choices can be implemented in dbt. When one has lot of data being sourced and needs to be used for analytic purposes, it is not possible to do full refreshes everyday. One might have to look at loading the data incrementally, to meet the SLAs and have improved performance. dbt uses the concept of snapshot in order to determine source freshness, this tells the use if the data at source has been updated and can be pulled into the data store. Quoting from dbt website: "This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse."
I hope you find the information here useful, designing proper data load and transformation strategies is key to having good data pipelines.



Monday, February 8, 2021

Data Pipelines/Data Transformation

 Data is available in abundance in organizations, more so in bigger companies. How do they make good use of the data to get valuable insights and add business value is the key strategic question in companies today. Early on there were lot of data warehouses, data marts built using complex ETL techniques, then there was this concept of ELT (Extraction, Loan and Transform) which was used to transport data. With the advent of Big data, AI/ML techniques there has been continuous need to improve the data integration in order to have successful data projects. One of the challenges companies have had is how to handle the flow of data in order to gain maximum value. There have been approaches such have one set of folks help with the sourcing of data, another set to transform and finally have the data science folks figure out the value. This approach has caused too many hand off points, and also caused lot of silos of expertise. In order to address this problem, a new concept is being used lot of organizations recently, it is called the Data Pipeline.

What is a Data Pipeline: It is the complete end to end process of getting the data from source and build the complete lifecycle (Source, transform, data quality checks, AI/ML model generation and Data visualization). Using this concept, resources are now being engaged to manage the complete pipeline, Data scientists/Analyst are being encouraged to manage and/own data pipelines. There are different tools available that help you manage the data pipeline. In some cases the data pipeline is also referred as workflows. The tools that are available today that help with these concepts are dbt:getdbt.com (Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.), please refer to https://docs.getdbt.com/docs/introduction.

Matillion,https://www.matillion.com/, Atlan (Data Management Tool), Arena from Zaloni and host of other tools like Collibra. The tools here help provide a complete end to end perspective of the data that is being used for different types of data projects. The tools mentioned above are cloud ready, with companies making the move to the cloud to store all of the data, the adoption of the above tools hopefully is much easier.

When a data scientist and/or analysts has the exposure to the tools mentioned above, they get a complete perspective of the data, understand the lineage which would help in building out better AI/ML Models that can be used by the companies. Lot of AI/Ml efforts fail because of bad data, unable to understand the lineage and dependencies. One of the key aspects to keep in mind while working on data projects is Data Drift, what does this mean? Data is not static, it keeps changing constantly, the structure of the data could change. There can be changes in schema, the granularity of data and the volume of data can keep fluctuating. When you have the tools that have been mentioned in the blog post earlier, they help understanding these changes to a great extent and help tuning the AI/ML Models. There is a lot more research made in the are of data drift and related tools. I will update on those in a different blog post.


Monday, January 11, 2021

Data Changes/Data Quality-AI/ML

 With AI/ML getting increased adoption in organizations, Data is fast becoming the most sought after asset. The flow of data is becoming more continuous, just like opening a tap and water flows.One of the trends i have noticed in data projects is that once the desired goal is achieved, it seems like there is a tendency to think that it is done. This is where the problem starts, data is not static, it is subject to changes or continually changes. When changes in data happens, it can affect the underlying structure or grain of the tables which are used in AI/ML Projects. If these changes are not determined or tracked upfront, it can cause the AI/ML Models generate inaccurate results. So the AI/ML models which were very efficient three months ago , now are not accurate anymore.

Tracking data changes either in terms of format coming from a vendor or schema changes in the source is very time consuming and inefficient/manual. There is very good opportunity to identify this as a problem area and look for if this gap can be addressed by providing automated solutions. With infrastructure and related components we have lots of utilities/dashboards that indicate if certain parameters are reaching a limit nd adjustments are needed. Using the same analogy we would need to have tools that can tell the data teams/business users any changes happening in data. This would help the data teams spend more time on tuning/optimizing the data models and not spend too much time on manual/inefficient tasks. Handling data changes is very critical for product management, with increasing reliance of product management on data. This is a little more of a abstract post but these are high level patterns that i have seen in the data related projects.