Sunday, August 29, 2021

Exploratory Data Analysis - EDA

 One of the key aspects of getting into AI,ML is to understand the data, what the data is trying to represent, is there data a god sample for the business problem one is trying to solve. In order to answer these questions, there needs to be scientific way to understand the data, so this is where Exploratory data analysis come into picture. In order to do EDA once can use tools or it is also possible to start getting to know the patterns with tools such as Excel. One need not feel the pressure of not knowing tools such s Python or R. You can gradually learn and getting into those techniques available in the tools.

The first step is to get the data from a trusted source, make sure it is a source that is relevant to the business problem we are trying to solve.  Now the data can be in different forms such as text files, csv, excel spreadsheets, logs or relational databases. Once you identify the source, choose a platform you want to bring the data into. For example let us assume spreadsheet is how you want to analyse the data.

1. Get the data formatted into a spreadsheet, if you are using a tool like python, read the data into a dataframe using pandas.
2. Understand the Business problem you are trying to solve, also know the domain area of the business.
3. Try as much as possible not to be biased, review the data points.
4. Discard data points that are redundant, would not offer any value. In case you have doubts ask the SME or folks who know the business/data.
5. Identify how many missing observations are there. if there is lot of data missing, discuss with the data source and get valid data.
6. A certain margin rate of missing values is acceptable, some say 5%. Identify those data attributes.
7. Begin to classify data, one can use pivot tables/charts in excel, use pandas in python to start grouping/categorizing data.
8.Document the trends, there are techniques like First Principles thinking (where one does not make any biases ,starts to analyze data from scratch).
9. Review if the data is balanced, for example if you are doing a batch prediction model, based on success/failures of daily jobs. In this case if there are more than 80% success or 80% failures then the data is skewed. Imbalance in data can cause models to generate bad predictions.
10. Add features to data, meaning add additional data points to enhance the data like adding categorical variables.
11. Once you have documented all your findings, you can proceed to the next step in determining the type of model needed.
Here is a link to First Principles thinking