Sunday, August 29, 2021

Exploratory Data Analysis - EDA

 One of the key aspects of getting into AI,ML is to understand the data, what the data is trying to represent, is there data a god sample for the business problem one is trying to solve. In order to answer these questions, there needs to be scientific way to understand the data, so this is where Exploratory data analysis come into picture. In order to do EDA once can use tools or it is also possible to start getting to know the patterns with tools such as Excel. One need not feel the pressure of not knowing tools such s Python or R. You can gradually learn and getting into those techniques available in the tools.

The first step is to get the data from a trusted source, make sure it is a source that is relevant to the business problem we are trying to solve.  Now the data can be in different forms such as text files, csv, excel spreadsheets, logs or relational databases. Once you identify the source, choose a platform you want to bring the data into. For example let us assume spreadsheet is how you want to analyse the data.

1. Get the data formatted into a spreadsheet, if you are using a tool like python, read the data into a dataframe using pandas.
2. Understand the Business problem you are trying to solve, also know the domain area of the business.
3. Try as much as possible not to be biased, review the data points.
4. Discard data points that are redundant, would not offer any value. In case you have doubts ask the SME or folks who know the business/data.
5. Identify how many missing observations are there. if there is lot of data missing, discuss with the data source and get valid data.
6. A certain margin rate of missing values is acceptable, some say 5%. Identify those data attributes.
7. Begin to classify data, one can use pivot tables/charts in excel, use pandas in python to start grouping/categorizing data.
8.Document the trends, there are techniques like First Principles thinking (where one does not make any biases ,starts to analyze data from scratch).
9. Review if the data is balanced, for example if you are doing a batch prediction model, based on success/failures of daily jobs. In this case if there are more than 80% success or 80% failures then the data is skewed. Imbalance in data can cause models to generate bad predictions.
10. Add features to data, meaning add additional data points to enhance the data like adding categorical variables.
11. Once you have documented all your findings, you can proceed to the next step in determining the type of model needed.
Here is a link to First Principles thinking


Saturday, July 31, 2021

Kafka - Event Processing

In a traditional data world which started to come into picture in the 1980s and become very popular during the 2000's is the concept of storing data based on entities/structured schema.
The field of Business Intelligence rose exponentially during this period and was the main area driving analytics. There was reporting, dashboards which started growing due to the flourish of BI. The infrastructure driving this was mainly servers, data centers, relational databases and in order to scale the operations there was a concept of web farms, clustering. Also with this type of  data structure, in case we had to track changes in data or entities, we relied on triggers, writing stored procedures. This was fine initially to provide such information to business and stakeholders. As the amount of data grew, requirements became complex and more time sensitive, there was a need to move to a scalable architecture. With the advent of big data technologies, there was one technology that is grown in popularity and usage, it is KAFKA.
Here are some basic concepts in Kafka, there lot of online tutorials on more in depth explanation of Kafka
Producer - Sends a Message Record (data), array of bytes
All records in table will be sent as message - Collect the result from query and sending each row as a message
You need to create a Producer application
Consumer - is an application that receives the data.
Producer sends data to Kafka server then requesting data from this is a Consumer
Producer -> Kafka Server -> Sends data to consumer
Broker is also the Kafka Server, it is a broker between Producer and consumer
Cluster - Can contain multiple brokers
Topic is a name given to a data set/stream
For example a topic can be called as Global Orders
Partition - Broker could have challenge in storing large amounts of data. Kafka can break a topic into partitions. How many partitions are needed, we need to make that decision for a topic. Every partition sits on a Single machine
Offset is a sequence number of message in a partition. Offsets starts from 0 for a message, they are local to the partition. To access a message Topic name, Partition Number, Offset number.

Consumer Group - Group of consumers , members of the group share the work
Retail chain
Billing counters
Producer for each Billing location
Sends message
Consumer will get the above messages
Create clusters and also create partition.
Consumer groups can then access a set of partitions


These are some basic concepts in Kafka...


Saturday, July 3, 2021

Data Governance - Data Management

Recently I had the opportunity to present at Datavader, a Data Science Learning Portal run by Shruthi Pandey. The topic of presentation was on Data Governance, it was a fun and interactive session with good participation given the importance of the topic. We had a Q&A Session at the end of the presentation. I am attaching some of the content from the deck of my Presentation, it is a simple compilation of articles , diagrams from different blogs, websites to convey the importance of Data Governance.

Datavader link : https://datavader.circle.so/home

Best Data Governance Tools

Here are a few Data Governance tools, these are Just a Sample:

All of the tools below provide the following:
The tools have a common theme - To provide a Modern Day Data Governance Platform and

Data Catalog Platform.
The tools have the ability to Connect to the following platforms - AWS,Azure,Google Cloud, Snowflake(a data ops platform)

The tools below help you manage data assets, google like user interfaces, ease of managing data lineage

Alation - https://www.alation.com/solutions/data-governance/ 

Atlan - https://atlan.com/
Zaloni (Arena) - https://www.zaloni.com/arena-overview/ 

Collibra - https://www.collibra.com/

Talend - https://www.talend.com/

Data Management Discipline with Python

The 4 C’s

Completeness — How much of the data that’s expected is there? Are the ‘important’ columns filled out? 

Consistency — Are data values the same across data sets?
Conformity — Does the data comply with specified formats?
Curiosity — Are stakeholders knowledgeable and/or engaged in the data management lifecycle?

The 4 V’s

Volume — How much information is present?
Velocity — What’s the frequency of incoming data? How often is the data processed and propagated? 

Variety — What types of data exist? Is the data structured, unstructured or semi-structured? Veracity — Is the data trustworthy? What are the inherent discrepancies?

import pandas as pd
import pandas_profilingdf = read_csv("raw_data.csv") df.describe()profile = pandas_profiling.ProfileReport(df) profile.to_file("profile.html")

https://towardsdatascience.com/automate-your- data-management-discipline-with-python- d7f3e1d78a89


Thursday, May 27, 2021

Data Science Life Cycle

Today in many organizations especially where data projects are being rolled out or data is being sold as product , one of the important aspects to be considered is the data science life cycle. This becomes very important when one is trying to use insights to drive the business forward. This could be in the following areas, I just picked a sample.

1. Revenue Generation
2. Customer Experience
3. Customer attrition
4. Increase Engagement

Here are some of the key Steps in a DS Life cycle:

1. Identification of Data Sources. Data (Structured, Unstructured)
2. Data Extraction
3. Data Storage (Persistence of Data), what type of Data storage - On Prem, Public Cloud, Private Cloud or Hybrid Cloud.
4. Data Engineering - Data Profiling,Data Quality Checks, Cleansing of Data, Data Transformation
Handling of bad data from Source.
5. Data Science Layer - This includes the following:
    a. Data Transformation - Addition of Categorical Variables
    b. Feature Engineering - Enhance the data set to make it more meaningful - requires domain expertise.
    c. Model Training, Testing and Validation
    d. Deployment of Models to Production
    e. Documentation of Models
    f. All of the above Steps could be iterative s new data keeps coming in
6. Data Delivery - How data is going to be sent to end users, stakeholders
   a. Method of Data Delivery - Web Services, Visualizations, Reports
   b. Actionable insights - There needs to be action taken on the insights provided. Today in organizations there is an issue of these insights not even being looked at. How does an organization go about this is very important.

Data Lineage/Governance - This step would cover all aspects of the steps mentioned here. This includes Data catalog, Metadata repository, Data Value Chains, Capturing important data assets. How data is being represented cross the enterprise.




Thursday, April 22, 2021

Data Ingestion-Hive/Impala/HUE

 With the emergence of all the big data, cloud technologies, there is a major push to get data moved from legacy systems to cloud. The reason for the migration is to provide better data access, allow for more advanced analytics, provide better insights and better customer experience. This type of migration is also to provide more self service capabilities. One of the initiative that i have been involved with is to move data from oracle databases to Hadoop. There are lot of pieces involved in moving data from RDBMS to hadoop. Here are some steps involved that i would like to share along with the tools involved.

1. Extraction of data from RDBMS to HDFS - What type of tools available: Sqoop works for pulling data from rdbms to HDFS, now one thing to watch out for is performance, how big are the tables. The data is extracted in the form of text files. So it is good to be clear of what type of data needs to be on HDFS. There are other tools like Zaloni(Arena),Atlan, some of these work well with Public Cloud providers.

2. Design the the structures where the files will be stored - The approach i am using is to utilize HIVE where in we have different schemas depending on the classification of data. Create tables in thee schemas and pull i the data using the text files created from Step 1. There are 2 main types of HIVE tables - Internal and External. HIVE supports SQL, it is very similar to SQL in RDBMS with few variations. There is UI available called HUE which is Hadoop User Experience. This is web based, so with the right permissions, one can connect to HIVE. Once you login, you can view the schemas/tables that have been created in HIVE. You can write SQL type queries and view results. HIVE is Apache based, where Impala is  Cloudera based. The difference is that in HIVE you have metadata and query capability whereas in IMPALA you mainly work with queries

3. Typically when you first load data into HIVE, they are text files, basically in Text format. These text files are loaded into tables. Now when you are ready to make the data available for processing, i decided to create separate schemas where the data is still stored in tables but it is stored in Parquet format. This allows for compression , it is in columnar format. For analytics purposes this works well and other thing is for our installations we use parquet. There are other big data formats that can be used:

Avro, Parquet, ORC, see the link below for comparisons.

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

4. Once Step 1 through 3 is completed then you are in a position to start provisioning the data, typically this data can feed into a data lake. The data lake can be on premise or it could be a private or a public cloud based on the business need,

Monday, April 5, 2021

Data Processing Unified - Omniscope

 There are lot of tools coming out in the market related to data management, analytics and visualization. It is also becoming harder for data practitioners to choose tools, so it is natural that we also expect to see consolidation of tools that handle data sourcing, management, transformation and visualization. One such tool that i got to see demonstrated is Omniscope. The link to get to see what the tool offer, please see here:https://visokio.com/. The tools goal is to unify the following components of a data journey

  1. Data Sourcing
  2. Data Blending - This is the step where you can combine data from different data sources. This will be very handy when say you are trying to combine customer information that is on a system different from Borrower/Loan information. This is an example, you will have scenarios where you have to perform this step.
  3. Data Transformation - This is the step where you take the source or the blended data and add additional attributes/calculations based on your end business goal.
  4. Data Lineage - The functionality that provides the journey of a data element as it flows from source to different target systems. This helps in also understanding how a data element is used in different calculations/reports.
  5. Data Visualization - The layer that displays the data in different forms (like graphs, charts, plots)

The tool helps data curators/scientists/analyst to maintain the pipeline of data coming through and how it is to be consumed by users. There are plenty of visualizations available in this tool and the color schemes used are very good. This tool also attempts to maintain lineage of the data that is used for the report, thereby providing an opportunity for end to end governance. In addition to all of these, there are features related to scheduling and automation. Please refer to the link above for more in depth look at the features of this tool.


Wednesday, March 17, 2021

Data Catalog/Documentation

One of the key aspects of Data Governance is the ability to have very good data documentation. Here we are discussing about the nature of data being collected. It is very important to have this information documented as this can help determining how the data can be used in various applications, also in building Machine Learning Models. I am including here a very good blog post written by Prukalpa Shankar (Co Founder of Atlan) . In this article the 5W1H framework is discussed extensively: What is the 5W1H, 

  1. WHAT
  2. WHO
  3. WHY
  4. WHEN
  5. WHERE
  6. HOW

https://towardsdatascience.com/data-documentation-woes-heres-a-framework-6aba8f20626c

Excellent insights are provided in the above post, please read and Keep Learning.