Friday, February 26, 2021

Data Management-Agile Methodology

 Data today is spread everywhere in organization, the rate at which data is incoming into an organization is increasing at a very rapid rate, with various touch points to collect data. Technology is also evolving at a rapid pace, so it has/is becoming very important for organizations to streamline data ingestion and catalog them appropriately. With Data strategy being closely tied with Business strategy, it has become very critical to deliver Business value effectively and quickly. In the past years there was a traditional approach being followed for data management. In this approach there were long lead times and the end product delivered was either late or did not meet the requirements. Fast Forward we are now in the age of agile delivery, devops, Continuous deployment and Integration. In the agile world the focus is to deliver incremental business value incrementally. How do we tackle data projects in the agile world, it is not cut and dry in the data world, there are lot of dependencies both on source systems, also there are provisioning systems that need data within SLA's. In order to address some of these challenges, there some key points that can be incorporated plus make use of tools that incorporate AI techniques.

1. Leadership should embrace Agile top down for Data Projects and there should be bottom up feedback on how agile is working for these projects.
2. Leaders/Business partners should provide framework, remove roadblocks, runways that would help an organization adopt Agile. There should be a mindset to tear down methods that wouldn't work in a modern enterprise, both business/technology should come closer together to deliver solutions.
3. Collaboration should be nurtured, allow the business and technology conversations to happen. There will be role specific responsibilities but that should not provide a roadblock to agile adoption.
4. Budgeting of activities/work need to change to adopt techniques like Activity Based Costing so that features/epics/deliverables can be funded accordingly.
5. Architecture needs to speed up adoption of latest generation Data Management Tools like Atlan,Arena and DQLabs in order to facilitate more efficient data ingestion, data profiling/quality and build effective lineage.
6. Availability of quality test data or have a framework to generate test data efficiently, this is real key to move along the work in the agile pipeline and have work ready for deployment. A key part of this is to have the ability to obfuscate the data especially when working with sensitive information.
7. One of the key aspects of modern data platforms is to have the metadata/catalog evolve alongside the data pipelines that are being built. In such scenarios the right set of data management tools can reduce technical debt. This is a very crucial aspect, identifying and handling this can limit the amount of work to fix data gap issues.
8. All of the above points need to come together so that you can evolve the data platform and data management in a agile way and match to the speed of the business. Business, Technology, Architecture, Stakeholders all have a role/responsibility to make Agile Data Management happen.

Thursday, February 11, 2021

Data Transformation for Cloud - dbt

 In this blog post we will focus on loading data from a valid source to a cloud data platform like snowflake. There are different tools available in order to do this, one of the tools that is gaining transaction is dbt,(https://docs.getdbt.com/). One of the main highlights of dbt is that it uses SQL for doing lot of the data transformations/loading into a cloud data warehouse like snowflake. There are certain additions that dbt has on top of using SQL that makes it very flexible for the ETL/ELT purposes. There are 2 ways to use dbt, one is to use the Command Line Interface, the other one is dbt cloud. There are lot of configurations available which can be set up to make the data transformation process efficient and effective. The core concept involved in dbt is called the models. dbt uses models extensively to create table/views on the cloud data warehouse. The order of creating the tables and views in the cloud data warehouse is taken care of dbt by using the concept of models. Models allows one to define the base objects and relationships.

In order to connect to the different data sources, there adapters available that dbt provides. These adapters allows dbt to connect to the datasource and load data into the target cloud data warehouse. For a list of the adapters available, please check the following link: https://docs.getdbt.com/docs/available-adapters. The adapters are primarily for cloud data warehouses/data lakes like Snowflake,Redshift,Bigquery. In order to start using dbt one has to create dbt project, to quote from dbt: A dbt project is a directory of .sql and .yml files, which dbt uses to transform your data. 

https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.

Typically in a ETL/ELT operation there are some considerations that need to be taken into account for loading data:

1. Is the data load into the cloud data warehouse going to be full refresh, if so how many tables follow this loading type.
2. Is the data load into the cloud data warehouse going to be incremental? If so how many tables follow this loading type.
3. Are there going to be materialized views that need to be created?
4. Is the warehouse going to have slowly changing dimension tables?
5. How are the relationships going to be defined.

Based on the on the above factors and the need of the business, all of the above choices can be implemented in dbt. When one has lot of data being sourced and needs to be used for analytic purposes, it is not possible to do full refreshes everyday. One might have to look at loading the data incrementally, to meet the SLAs and have improved performance. dbt uses the concept of snapshot in order to determine source freshness, this tells the use if the data at source has been updated and can be pulled into the data store. Quoting from dbt website: "This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse."
I hope you find the information here useful, designing proper data load and transformation strategies is key to having good data pipelines.



Monday, February 8, 2021

Data Pipelines/Data Transformation

 Data is available in abundance in organizations, more so in bigger companies. How do they make good use of the data to get valuable insights and add business value is the key strategic question in companies today. Early on there were lot of data warehouses, data marts built using complex ETL techniques, then there was this concept of ELT (Extraction, Loan and Transform) which was used to transport data. With the advent of Big data, AI/ML techniques there has been continuous need to improve the data integration in order to have successful data projects. One of the challenges companies have had is how to handle the flow of data in order to gain maximum value. There have been approaches such have one set of folks help with the sourcing of data, another set to transform and finally have the data science folks figure out the value. This approach has caused too many hand off points, and also caused lot of silos of expertise. In order to address this problem, a new concept is being used lot of organizations recently, it is called the Data Pipeline.

What is a Data Pipeline: It is the complete end to end process of getting the data from source and build the complete lifecycle (Source, transform, data quality checks, AI/ML model generation and Data visualization). Using this concept, resources are now being engaged to manage the complete pipeline, Data scientists/Analyst are being encouraged to manage and/own data pipelines. There are different tools available that help you manage the data pipeline. In some cases the data pipeline is also referred as workflows. The tools that are available today that help with these concepts are dbt:getdbt.com (Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.), please refer to https://docs.getdbt.com/docs/introduction.

Matillion,https://www.matillion.com/, Atlan (Data Management Tool), Arena from Zaloni and host of other tools like Collibra. The tools here help provide a complete end to end perspective of the data that is being used for different types of data projects. The tools mentioned above are cloud ready, with companies making the move to the cloud to store all of the data, the adoption of the above tools hopefully is much easier.

When a data scientist and/or analysts has the exposure to the tools mentioned above, they get a complete perspective of the data, understand the lineage which would help in building out better AI/ML Models that can be used by the companies. Lot of AI/Ml efforts fail because of bad data, unable to understand the lineage and dependencies. One of the key aspects to keep in mind while working on data projects is Data Drift, what does this mean? Data is not static, it keeps changing constantly, the structure of the data could change. There can be changes in schema, the granularity of data and the volume of data can keep fluctuating. When you have the tools that have been mentioned in the blog post earlier, they help understanding these changes to a great extent and help tuning the AI/ML Models. There is a lot more research made in the are of data drift and related tools. I will update on those in a different blog post.


Monday, January 11, 2021

Data Changes/Data Quality-AI/ML

 With AI/ML getting increased adoption in organizations, Data is fast becoming the most sought after asset. The flow of data is becoming more continuous, just like opening a tap and water flows.One of the trends i have noticed in data projects is that once the desired goal is achieved, it seems like there is a tendency to think that it is done. This is where the problem starts, data is not static, it is subject to changes or continually changes. When changes in data happens, it can affect the underlying structure or grain of the tables which are used in AI/ML Projects. If these changes are not determined or tracked upfront, it can cause the AI/ML Models generate inaccurate results. So the AI/ML models which were very efficient three months ago , now are not accurate anymore.

Tracking data changes either in terms of format coming from a vendor or schema changes in the source is very time consuming and inefficient/manual. There is very good opportunity to identify this as a problem area and look for if this gap can be addressed by providing automated solutions. With infrastructure and related components we have lots of utilities/dashboards that indicate if certain parameters are reaching a limit nd adjustments are needed. Using the same analogy we would need to have tools that can tell the data teams/business users any changes happening in data. This would help the data teams spend more time on tuning/optimizing the data models and not spend too much time on manual/inefficient tasks. Handling data changes is very critical for product management, with increasing reliance of product management on data. This is a little more of a abstract post but these are high level patterns that i have seen in the data related projects.

Tuesday, December 15, 2020

AI/ML Framework

There is a lot of focus on Artificial Intelligence, Machine Learning in various organizations and they want to get better insights and value out of such efforts. There are significant investments made in terms of time, money to get AI projects started and delivered. One of the key aspects to kept in mind , the success of such projects depends on sourcing the right date, have the right data governance and make sure such efforts align with the business goals. Given such a environment it is imperative that there is a general framework in place as to how AI Projects are executed, so that there are truly value added deliverables being achieved. In a Data world we have the following broad roles trying to make a AI Project a success.

Possible Personas in a data project.

1. Data Engineers
2. Data Analysts
3. Data Scientists
4. AI/ML Developers, Model Creators
5. End users/Reviewers of Model for Compliance and regulations

Each of the above personas are going to be involved in different stages of a data project and would be utilizing different tools to achieve the end goal. When we talk about different stages we could identify that the following steps would be there in any data project, they are:

1. Sourcing data - Sample Tools IBM DataStage, Informatica, Structured Database Oracle, SQL Server, Teradata, Unstructured Data: Infobase
2. Organize data/Metadata Management/Lineage Analysis/Cataloging Trifacta/Alteryx for Data Wrangling/Prep, Data Catalog/Governance Atlan/Collibra
3. Build Model/Experiments/Analysis - SAS, JUPYTER (Python), R, H2O
4. Quality Check, Deployments of ML Models - Model Frameworks Such as H20, Python, R
5. Consumption of Data By end Users - Tableau, Cognos, Microstrategy

All of the above would require  storage and consumption component, these could be Hadoop/Spark, AWS/SnowFlake, Azure or Google Cloud. 

When you organize the different tool sets and personas it will provide an overview of what is available and how AI projects could be structured to deliver effective value. For example if there is a lot of need to consume unstructured data and derive data from it in a financial institution for example, once could look at leveraging the Instabase Platform : Quoting from the site: The biggest opportunity for efficiency within a business is stifled by document processes that only humans can do. With Instabase, you can build workflows to access information in complex documents and automate business processes.https://about.instabase.com/.  
Providing a platform where all of these tools can be used so that different personas can have access and move data from one point to another would be of great help for any data project. This would allow consistency of operation , help in tracing data, provide better model validation, make sure any audit/compliance requirements are met.
Successful data projects that include AI/ML have the above ingredients in the right mix, well tracked/cataloged and also take care of changes in data over time plus closely align with the business objectives.
Happy Holidays and  a Very Happy, pandemic free New Year 2021, Stay safe everyone.


Monday, December 7, 2020

Snowflake - UI Components

Snowflake is fast becoming a very important component of the cloud migration strategies in the business/technology world today. I have been in discussion with different leaders, also in participating in different conferences there is a lot of interest in Snowflake. One good thing is that there is a trial period for snowflake which one can sign up for. You can go to the Snowflake web site and sign up for the trial. The sign up process for the trial is very straightforward and once you have it set up, you are provided with an option to go through the introductory material related to snowflake and there are some very good documentation related to different areas in Snowflake. Once you sign into snowflake you should see the following interface. There are Different components that are listed in the interface, lot of it looks very similar to the sql server management studio layout especially the object explorer. The main components in the UI are:

1. Database - Lists the Databases in Snowflake Instance.
2. Shares - This is related to Data Sharing within your Organization.
3. Data Marketplace - This option allows one to look at the Snowflake Data Marketplace, which lists the public data sources that are available to different categories like Government, Financial, Health, Sports
4. Warehouses - Lists the warehouses available on the snowflake instance.
5. Worksheets - The option where one can write SQL Queries like below. There is a sample database called DEMO DB which has list of tables that can be used for querying. Also these queries can be used for building out Dashboards that is available in the Data Market place option.


One can also load data into snowflake based on the different data loading strategies that are available which has been discussed in earlier blog post written by me: https://www.blogger.com/blog/post/edit/2437651727370625818/3834832172078982453




Wednesday, November 18, 2020

Data Cloud

 One of the more commonly used buzz words these days is Data Cloud, it has been used as a marketing term mainly in the Cloud Domain across different business/organizations. There is a concept underneath the word Data cloud, it is mainly aimed at having data available in/migrated in the public cloud offerings such as Amazon, azure, Google Cloud. One of the key projects that have been undertaken by lots of organizations across different Business is how does one have data available in the cloud without compromising on the security aspect of data. I had the opportunity to attend listen to the data cloud summit 2020 organized by Snowflake. It was a virtual event organized by Snowflake where features of the snowflake was discussed in different sessions. There were also use cases presented by different Customers, Vendor Partners on how they are utilizing snowflake for the data projects and how much impact has this product had on their Business. There were some interesting points that i had picked up from different Sessions, i am listing them below. They cover a variety of topics related to Data.

1. Compute/Storage: Snowflake Separates compute from Storage, this is one of the main concepts in this product which has been highlighted by lot of customers. How this concept helps them in their daily data operations and business.
2. Scalability - Ability to ingest multiple workloads, this is a common requirement across all customers.
3. Simplify: Simplification of the Data Pipeline. How can one get the raw data and turn them into actionable insights quickly, this is called the lapse time. One of the questions raised is that does all the transformation of the data happen during early hours in the morning? Can this be spread out or done in real time?
4. Data Silos: Breaking down Data silos is a significant effort that is being undertaken by different organizations. Data Silos has a direct/indirect impact on cost and efficiency in a very negative way. One of the reasons for using a product like Snowflake is to break down the data silios, having data data in one place. This would allow better understandability ad searchability for the data in an organization.
5. Proof of Value: Data cloud products or cloud offerings need to provide a proof of value. It has to be tangible for the business, how does the investment in cloud provide better results for the business.
6. Orchestration: Since there is movement to cloud infrastructure taking place at different pace, there needs to be better orchestration across multiple cloud installations. This can lead to better abstraction, this is the challenge lot of companies are facing today.
7. Data is an Asset: Data can be monetized in the following ways: Generating value for the Business, reducing costs.
8. Support: Snowflake provides good support tools, Cost Effective. Some of the customers explained how the uptime of snowflake has been very good in spite of the huge data loads coming into the system.
9. Data: What type of information needs to be sent out/provisioned. One of the guest mentioned there are two important aspects with respect to data: 1. Information that a person needs to know, 2, How the information will affect you.

Overall lot of information in a single day event, i am sure each one of the aspects mentioned above can lead to deeper discussions and/or projects. The event provided a overall perspective of where things are headed in the data space and how companies are planning their work in the coming years.