Sunday, August 29, 2021

Exploratory Data Analysis - EDA

 One of the key aspects of getting into AI,ML is to understand the data, what the data is trying to represent, is there data a god sample for the business problem one is trying to solve. In order to answer these questions, there needs to be scientific way to understand the data, so this is where Exploratory data analysis come into picture. In order to do EDA once can use tools or it is also possible to start getting to know the patterns with tools such as Excel. One need not feel the pressure of not knowing tools such s Python or R. You can gradually learn and getting into those techniques available in the tools.

The first step is to get the data from a trusted source, make sure it is a source that is relevant to the business problem we are trying to solve.  Now the data can be in different forms such as text files, csv, excel spreadsheets, logs or relational databases. Once you identify the source, choose a platform you want to bring the data into. For example let us assume spreadsheet is how you want to analyse the data.

1. Get the data formatted into a spreadsheet, if you are using a tool like python, read the data into a dataframe using pandas.
2. Understand the Business problem you are trying to solve, also know the domain area of the business.
3. Try as much as possible not to be biased, review the data points.
4. Discard data points that are redundant, would not offer any value. In case you have doubts ask the SME or folks who know the business/data.
5. Identify how many missing observations are there. if there is lot of data missing, discuss with the data source and get valid data.
6. A certain margin rate of missing values is acceptable, some say 5%. Identify those data attributes.
7. Begin to classify data, one can use pivot tables/charts in excel, use pandas in python to start grouping/categorizing data.
8.Document the trends, there are techniques like First Principles thinking (where one does not make any biases ,starts to analyze data from scratch).
9. Review if the data is balanced, for example if you are doing a batch prediction model, based on success/failures of daily jobs. In this case if there are more than 80% success or 80% failures then the data is skewed. Imbalance in data can cause models to generate bad predictions.
10. Add features to data, meaning add additional data points to enhance the data like adding categorical variables.
11. Once you have documented all your findings, you can proceed to the next step in determining the type of model needed.
Here is a link to First Principles thinking


Saturday, July 31, 2021

Kafka - Event Processing

In a traditional data world which started to come into picture in the 1980s and become very popular during the 2000's is the concept of storing data based on entities/structured schema.
The field of Business Intelligence rose exponentially during this period and was the main area driving analytics. There was reporting, dashboards which started growing due to the flourish of BI. The infrastructure driving this was mainly servers, data centers, relational databases and in order to scale the operations there was a concept of web farms, clustering. Also with this type of  data structure, in case we had to track changes in data or entities, we relied on triggers, writing stored procedures. This was fine initially to provide such information to business and stakeholders. As the amount of data grew, requirements became complex and more time sensitive, there was a need to move to a scalable architecture. With the advent of big data technologies, there was one technology that is grown in popularity and usage, it is KAFKA.
Here are some basic concepts in Kafka, there lot of online tutorials on more in depth explanation of Kafka
Producer - Sends a Message Record (data), array of bytes
All records in table will be sent as message - Collect the result from query and sending each row as a message
You need to create a Producer application
Consumer - is an application that receives the data.
Producer sends data to Kafka server then requesting data from this is a Consumer
Producer -> Kafka Server -> Sends data to consumer
Broker is also the Kafka Server, it is a broker between Producer and consumer
Cluster - Can contain multiple brokers
Topic is a name given to a data set/stream
For example a topic can be called as Global Orders
Partition - Broker could have challenge in storing large amounts of data. Kafka can break a topic into partitions. How many partitions are needed, we need to make that decision for a topic. Every partition sits on a Single machine
Offset is a sequence number of message in a partition. Offsets starts from 0 for a message, they are local to the partition. To access a message Topic name, Partition Number, Offset number.

Consumer Group - Group of consumers , members of the group share the work
Retail chain
Billing counters
Producer for each Billing location
Sends message
Consumer will get the above messages
Create clusters and also create partition.
Consumer groups can then access a set of partitions


These are some basic concepts in Kafka...


Saturday, July 3, 2021

Data Governance - Data Management

Recently I had the opportunity to present at Datavader, a Data Science Learning Portal run by Shruthi Pandey. The topic of presentation was on Data Governance, it was a fun and interactive session with good participation given the importance of the topic. We had a Q&A Session at the end of the presentation. I am attaching some of the content from the deck of my Presentation, it is a simple compilation of articles , diagrams from different blogs, websites to convey the importance of Data Governance.

Datavader link : https://datavader.circle.so/home

Best Data Governance Tools

Here are a few Data Governance tools, these are Just a Sample:

All of the tools below provide the following:
The tools have a common theme - To provide a Modern Day Data Governance Platform and

Data Catalog Platform.
The tools have the ability to Connect to the following platforms - AWS,Azure,Google Cloud, Snowflake(a data ops platform)

The tools below help you manage data assets, google like user interfaces, ease of managing data lineage

Alation - https://www.alation.com/solutions/data-governance/ 

Atlan - https://atlan.com/
Zaloni (Arena) - https://www.zaloni.com/arena-overview/ 

Collibra - https://www.collibra.com/

Talend - https://www.talend.com/

Data Management Discipline with Python

The 4 C’s

Completeness — How much of the data that’s expected is there? Are the ‘important’ columns filled out? 

Consistency — Are data values the same across data sets?
Conformity — Does the data comply with specified formats?
Curiosity — Are stakeholders knowledgeable and/or engaged in the data management lifecycle?

The 4 V’s

Volume — How much information is present?
Velocity — What’s the frequency of incoming data? How often is the data processed and propagated? 

Variety — What types of data exist? Is the data structured, unstructured or semi-structured? Veracity — Is the data trustworthy? What are the inherent discrepancies?

import pandas as pd
import pandas_profilingdf = read_csv("raw_data.csv") df.describe()profile = pandas_profiling.ProfileReport(df) profile.to_file("profile.html")

https://towardsdatascience.com/automate-your- data-management-discipline-with-python- d7f3e1d78a89


Thursday, May 27, 2021

Data Science Life Cycle

Today in many organizations especially where data projects are being rolled out or data is being sold as product , one of the important aspects to be considered is the data science life cycle. This becomes very important when one is trying to use insights to drive the business forward. This could be in the following areas, I just picked a sample.

1. Revenue Generation
2. Customer Experience
3. Customer attrition
4. Increase Engagement

Here are some of the key Steps in a DS Life cycle:

1. Identification of Data Sources. Data (Structured, Unstructured)
2. Data Extraction
3. Data Storage (Persistence of Data), what type of Data storage - On Prem, Public Cloud, Private Cloud or Hybrid Cloud.
4. Data Engineering - Data Profiling,Data Quality Checks, Cleansing of Data, Data Transformation
Handling of bad data from Source.
5. Data Science Layer - This includes the following:
    a. Data Transformation - Addition of Categorical Variables
    b. Feature Engineering - Enhance the data set to make it more meaningful - requires domain expertise.
    c. Model Training, Testing and Validation
    d. Deployment of Models to Production
    e. Documentation of Models
    f. All of the above Steps could be iterative s new data keeps coming in
6. Data Delivery - How data is going to be sent to end users, stakeholders
   a. Method of Data Delivery - Web Services, Visualizations, Reports
   b. Actionable insights - There needs to be action taken on the insights provided. Today in organizations there is an issue of these insights not even being looked at. How does an organization go about this is very important.

Data Lineage/Governance - This step would cover all aspects of the steps mentioned here. This includes Data catalog, Metadata repository, Data Value Chains, Capturing important data assets. How data is being represented cross the enterprise.




Thursday, April 22, 2021

Data Ingestion-Hive/Impala/HUE

 With the emergence of all the big data, cloud technologies, there is a major push to get data moved from legacy systems to cloud. The reason for the migration is to provide better data access, allow for more advanced analytics, provide better insights and better customer experience. This type of migration is also to provide more self service capabilities. One of the initiative that i have been involved with is to move data from oracle databases to Hadoop. There are lot of pieces involved in moving data from RDBMS to hadoop. Here are some steps involved that i would like to share along with the tools involved.

1. Extraction of data from RDBMS to HDFS - What type of tools available: Sqoop works for pulling data from rdbms to HDFS, now one thing to watch out for is performance, how big are the tables. The data is extracted in the form of text files. So it is good to be clear of what type of data needs to be on HDFS. There are other tools like Zaloni(Arena),Atlan, some of these work well with Public Cloud providers.

2. Design the the structures where the files will be stored - The approach i am using is to utilize HIVE where in we have different schemas depending on the classification of data. Create tables in thee schemas and pull i the data using the text files created from Step 1. There are 2 main types of HIVE tables - Internal and External. HIVE supports SQL, it is very similar to SQL in RDBMS with few variations. There is UI available called HUE which is Hadoop User Experience. This is web based, so with the right permissions, one can connect to HIVE. Once you login, you can view the schemas/tables that have been created in HIVE. You can write SQL type queries and view results. HIVE is Apache based, where Impala is  Cloudera based. The difference is that in HIVE you have metadata and query capability whereas in IMPALA you mainly work with queries

3. Typically when you first load data into HIVE, they are text files, basically in Text format. These text files are loaded into tables. Now when you are ready to make the data available for processing, i decided to create separate schemas where the data is still stored in tables but it is stored in Parquet format. This allows for compression , it is in columnar format. For analytics purposes this works well and other thing is for our installations we use parquet. There are other big data formats that can be used:

Avro, Parquet, ORC, see the link below for comparisons.

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

4. Once Step 1 through 3 is completed then you are in a position to start provisioning the data, typically this data can feed into a data lake. The data lake can be on premise or it could be a private or a public cloud based on the business need,

Monday, April 5, 2021

Data Processing Unified - Omniscope

 There are lot of tools coming out in the market related to data management, analytics and visualization. It is also becoming harder for data practitioners to choose tools, so it is natural that we also expect to see consolidation of tools that handle data sourcing, management, transformation and visualization. One such tool that i got to see demonstrated is Omniscope. The link to get to see what the tool offer, please see here:https://visokio.com/. The tools goal is to unify the following components of a data journey

  1. Data Sourcing
  2. Data Blending - This is the step where you can combine data from different data sources. This will be very handy when say you are trying to combine customer information that is on a system different from Borrower/Loan information. This is an example, you will have scenarios where you have to perform this step.
  3. Data Transformation - This is the step where you take the source or the blended data and add additional attributes/calculations based on your end business goal.
  4. Data Lineage - The functionality that provides the journey of a data element as it flows from source to different target systems. This helps in also understanding how a data element is used in different calculations/reports.
  5. Data Visualization - The layer that displays the data in different forms (like graphs, charts, plots)

The tool helps data curators/scientists/analyst to maintain the pipeline of data coming through and how it is to be consumed by users. There are plenty of visualizations available in this tool and the color schemes used are very good. This tool also attempts to maintain lineage of the data that is used for the report, thereby providing an opportunity for end to end governance. In addition to all of these, there are features related to scheduling and automation. Please refer to the link above for more in depth look at the features of this tool.


Wednesday, March 17, 2021

Data Catalog/Documentation

One of the key aspects of Data Governance is the ability to have very good data documentation. Here we are discussing about the nature of data being collected. It is very important to have this information documented as this can help determining how the data can be used in various applications, also in building Machine Learning Models. I am including here a very good blog post written by Prukalpa Shankar (Co Founder of Atlan) . In this article the 5W1H framework is discussed extensively: What is the 5W1H, 

  1. WHAT
  2. WHO
  3. WHY
  4. WHEN
  5. WHERE
  6. HOW

https://towardsdatascience.com/data-documentation-woes-heres-a-framework-6aba8f20626c

Excellent insights are provided in the above post, please read and Keep Learning.

Friday, February 26, 2021

Data Management-Agile Methodology

 Data today is spread everywhere in organization, the rate at which data is incoming into an organization is increasing at a very rapid rate, with various touch points to collect data. Technology is also evolving at a rapid pace, so it has/is becoming very important for organizations to streamline data ingestion and catalog them appropriately. With Data strategy being closely tied with Business strategy, it has become very critical to deliver Business value effectively and quickly. In the past years there was a traditional approach being followed for data management. In this approach there were long lead times and the end product delivered was either late or did not meet the requirements. Fast Forward we are now in the age of agile delivery, devops, Continuous deployment and Integration. In the agile world the focus is to deliver incremental business value incrementally. How do we tackle data projects in the agile world, it is not cut and dry in the data world, there are lot of dependencies both on source systems, also there are provisioning systems that need data within SLA's. In order to address some of these challenges, there some key points that can be incorporated plus make use of tools that incorporate AI techniques.

1. Leadership should embrace Agile top down for Data Projects and there should be bottom up feedback on how agile is working for these projects.
2. Leaders/Business partners should provide framework, remove roadblocks, runways that would help an organization adopt Agile. There should be a mindset to tear down methods that wouldn't work in a modern enterprise, both business/technology should come closer together to deliver solutions.
3. Collaboration should be nurtured, allow the business and technology conversations to happen. There will be role specific responsibilities but that should not provide a roadblock to agile adoption.
4. Budgeting of activities/work need to change to adopt techniques like Activity Based Costing so that features/epics/deliverables can be funded accordingly.
5. Architecture needs to speed up adoption of latest generation Data Management Tools like Atlan,Arena and DQLabs in order to facilitate more efficient data ingestion, data profiling/quality and build effective lineage.
6. Availability of quality test data or have a framework to generate test data efficiently, this is real key to move along the work in the agile pipeline and have work ready for deployment. A key part of this is to have the ability to obfuscate the data especially when working with sensitive information.
7. One of the key aspects of modern data platforms is to have the metadata/catalog evolve alongside the data pipelines that are being built. In such scenarios the right set of data management tools can reduce technical debt. This is a very crucial aspect, identifying and handling this can limit the amount of work to fix data gap issues.
8. All of the above points need to come together so that you can evolve the data platform and data management in a agile way and match to the speed of the business. Business, Technology, Architecture, Stakeholders all have a role/responsibility to make Agile Data Management happen.

Thursday, February 11, 2021

Data Transformation for Cloud - dbt

 In this blog post we will focus on loading data from a valid source to a cloud data platform like snowflake. There are different tools available in order to do this, one of the tools that is gaining transaction is dbt,(https://docs.getdbt.com/). One of the main highlights of dbt is that it uses SQL for doing lot of the data transformations/loading into a cloud data warehouse like snowflake. There are certain additions that dbt has on top of using SQL that makes it very flexible for the ETL/ELT purposes. There are 2 ways to use dbt, one is to use the Command Line Interface, the other one is dbt cloud. There are lot of configurations available which can be set up to make the data transformation process efficient and effective. The core concept involved in dbt is called the models. dbt uses models extensively to create table/views on the cloud data warehouse. The order of creating the tables and views in the cloud data warehouse is taken care of dbt by using the concept of models. Models allows one to define the base objects and relationships.

In order to connect to the different data sources, there adapters available that dbt provides. These adapters allows dbt to connect to the datasource and load data into the target cloud data warehouse. For a list of the adapters available, please check the following link: https://docs.getdbt.com/docs/available-adapters. The adapters are primarily for cloud data warehouses/data lakes like Snowflake,Redshift,Bigquery. In order to start using dbt one has to create dbt project, to quote from dbt: A dbt project is a directory of .sql and .yml files, which dbt uses to transform your data. 

https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.

Typically in a ETL/ELT operation there are some considerations that need to be taken into account for loading data:

1. Is the data load into the cloud data warehouse going to be full refresh, if so how many tables follow this loading type.
2. Is the data load into the cloud data warehouse going to be incremental? If so how many tables follow this loading type.
3. Are there going to be materialized views that need to be created?
4. Is the warehouse going to have slowly changing dimension tables?
5. How are the relationships going to be defined.

Based on the on the above factors and the need of the business, all of the above choices can be implemented in dbt. When one has lot of data being sourced and needs to be used for analytic purposes, it is not possible to do full refreshes everyday. One might have to look at loading the data incrementally, to meet the SLAs and have improved performance. dbt uses the concept of snapshot in order to determine source freshness, this tells the use if the data at source has been updated and can be pulled into the data store. Quoting from dbt website: "This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse."
I hope you find the information here useful, designing proper data load and transformation strategies is key to having good data pipelines.



Monday, February 8, 2021

Data Pipelines/Data Transformation

 Data is available in abundance in organizations, more so in bigger companies. How do they make good use of the data to get valuable insights and add business value is the key strategic question in companies today. Early on there were lot of data warehouses, data marts built using complex ETL techniques, then there was this concept of ELT (Extraction, Loan and Transform) which was used to transport data. With the advent of Big data, AI/ML techniques there has been continuous need to improve the data integration in order to have successful data projects. One of the challenges companies have had is how to handle the flow of data in order to gain maximum value. There have been approaches such have one set of folks help with the sourcing of data, another set to transform and finally have the data science folks figure out the value. This approach has caused too many hand off points, and also caused lot of silos of expertise. In order to address this problem, a new concept is being used lot of organizations recently, it is called the Data Pipeline.

What is a Data Pipeline: It is the complete end to end process of getting the data from source and build the complete lifecycle (Source, transform, data quality checks, AI/ML model generation and Data visualization). Using this concept, resources are now being engaged to manage the complete pipeline, Data scientists/Analyst are being encouraged to manage and/own data pipelines. There are different tools available that help you manage the data pipeline. In some cases the data pipeline is also referred as workflows. The tools that are available today that help with these concepts are dbt:getdbt.com (Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.), please refer to https://docs.getdbt.com/docs/introduction.

Matillion,https://www.matillion.com/, Atlan (Data Management Tool), Arena from Zaloni and host of other tools like Collibra. The tools here help provide a complete end to end perspective of the data that is being used for different types of data projects. The tools mentioned above are cloud ready, with companies making the move to the cloud to store all of the data, the adoption of the above tools hopefully is much easier.

When a data scientist and/or analysts has the exposure to the tools mentioned above, they get a complete perspective of the data, understand the lineage which would help in building out better AI/ML Models that can be used by the companies. Lot of AI/Ml efforts fail because of bad data, unable to understand the lineage and dependencies. One of the key aspects to keep in mind while working on data projects is Data Drift, what does this mean? Data is not static, it keeps changing constantly, the structure of the data could change. There can be changes in schema, the granularity of data and the volume of data can keep fluctuating. When you have the tools that have been mentioned in the blog post earlier, they help understanding these changes to a great extent and help tuning the AI/ML Models. There is a lot more research made in the are of data drift and related tools. I will update on those in a different blog post.


Monday, January 11, 2021

Data Changes/Data Quality-AI/ML

 With AI/ML getting increased adoption in organizations, Data is fast becoming the most sought after asset. The flow of data is becoming more continuous, just like opening a tap and water flows.One of the trends i have noticed in data projects is that once the desired goal is achieved, it seems like there is a tendency to think that it is done. This is where the problem starts, data is not static, it is subject to changes or continually changes. When changes in data happens, it can affect the underlying structure or grain of the tables which are used in AI/ML Projects. If these changes are not determined or tracked upfront, it can cause the AI/ML Models generate inaccurate results. So the AI/ML models which were very efficient three months ago , now are not accurate anymore.

Tracking data changes either in terms of format coming from a vendor or schema changes in the source is very time consuming and inefficient/manual. There is very good opportunity to identify this as a problem area and look for if this gap can be addressed by providing automated solutions. With infrastructure and related components we have lots of utilities/dashboards that indicate if certain parameters are reaching a limit nd adjustments are needed. Using the same analogy we would need to have tools that can tell the data teams/business users any changes happening in data. This would help the data teams spend more time on tuning/optimizing the data models and not spend too much time on manual/inefficient tasks. Handling data changes is very critical for product management, with increasing reliance of product management on data. This is a little more of a abstract post but these are high level patterns that i have seen in the data related projects.