Saturday, July 31, 2021

Kafka - Event Processing

In a traditional data world which started to come into picture in the 1980s and become very popular during the 2000's is the concept of storing data based on entities/structured schema.
The field of Business Intelligence rose exponentially during this period and was the main area driving analytics. There was reporting, dashboards which started growing due to the flourish of BI. The infrastructure driving this was mainly servers, data centers, relational databases and in order to scale the operations there was a concept of web farms, clustering. Also with this type of  data structure, in case we had to track changes in data or entities, we relied on triggers, writing stored procedures. This was fine initially to provide such information to business and stakeholders. As the amount of data grew, requirements became complex and more time sensitive, there was a need to move to a scalable architecture. With the advent of big data technologies, there was one technology that is grown in popularity and usage, it is KAFKA.
Here are some basic concepts in Kafka, there lot of online tutorials on more in depth explanation of Kafka
Producer - Sends a Message Record (data), array of bytes
All records in table will be sent as message - Collect the result from query and sending each row as a message
You need to create a Producer application
Consumer - is an application that receives the data.
Producer sends data to Kafka server then requesting data from this is a Consumer
Producer -> Kafka Server -> Sends data to consumer
Broker is also the Kafka Server, it is a broker between Producer and consumer
Cluster - Can contain multiple brokers
Topic is a name given to a data set/stream
For example a topic can be called as Global Orders
Partition - Broker could have challenge in storing large amounts of data. Kafka can break a topic into partitions. How many partitions are needed, we need to make that decision for a topic. Every partition sits on a Single machine
Offset is a sequence number of message in a partition. Offsets starts from 0 for a message, they are local to the partition. To access a message Topic name, Partition Number, Offset number.

Consumer Group - Group of consumers , members of the group share the work
Retail chain
Billing counters
Producer for each Billing location
Sends message
Consumer will get the above messages
Create clusters and also create partition.
Consumer groups can then access a set of partitions


These are some basic concepts in Kafka...


Saturday, July 3, 2021

Data Governance - Data Management

Recently I had the opportunity to present at Datavader, a Data Science Learning Portal run by Shruthi Pandey. The topic of presentation was on Data Governance, it was a fun and interactive session with good participation given the importance of the topic. We had a Q&A Session at the end of the presentation. I am attaching some of the content from the deck of my Presentation, it is a simple compilation of articles , diagrams from different blogs, websites to convey the importance of Data Governance.

Datavader link : https://datavader.circle.so/home

Best Data Governance Tools

Here are a few Data Governance tools, these are Just a Sample:

All of the tools below provide the following:
The tools have a common theme - To provide a Modern Day Data Governance Platform and

Data Catalog Platform.
The tools have the ability to Connect to the following platforms - AWS,Azure,Google Cloud, Snowflake(a data ops platform)

The tools below help you manage data assets, google like user interfaces, ease of managing data lineage

Alation - https://www.alation.com/solutions/data-governance/ 

Atlan - https://atlan.com/
Zaloni (Arena) - https://www.zaloni.com/arena-overview/ 

Collibra - https://www.collibra.com/

Talend - https://www.talend.com/

Data Management Discipline with Python

The 4 C’s

Completeness — How much of the data that’s expected is there? Are the ‘important’ columns filled out? 

Consistency — Are data values the same across data sets?
Conformity — Does the data comply with specified formats?
Curiosity — Are stakeholders knowledgeable and/or engaged in the data management lifecycle?

The 4 V’s

Volume — How much information is present?
Velocity — What’s the frequency of incoming data? How often is the data processed and propagated? 

Variety — What types of data exist? Is the data structured, unstructured or semi-structured? Veracity — Is the data trustworthy? What are the inherent discrepancies?

import pandas as pd
import pandas_profilingdf = read_csv("raw_data.csv") df.describe()profile = pandas_profiling.ProfileReport(df) profile.to_file("profile.html")

https://towardsdatascience.com/automate-your- data-management-discipline-with-python- d7f3e1d78a89