Saturday, July 31, 2021
Kafka - Event Processing
Saturday, July 3, 2021
Data Governance - Data Management
Recently I had the opportunity to present at Datavader, a Data Science Learning Portal run by Shruthi Pandey. The topic of presentation was on Data Governance, it was a fun and interactive session with good participation given the importance of the topic. We had a Q&A Session at the end of the presentation. I am attaching some of the content from the deck of my Presentation, it is a simple compilation of articles , diagrams from different blogs, websites to convey the importance of Data Governance.
Datavader link : https://datavader.circle.so/home
Best Data Governance Tools
Here are a few Data Governance tools, these are Just a Sample:
All of the tools below provide the following:
The tools have a common theme - To provide a Modern Day Data Governance Platform and
Data Catalog Platform.
The tools have the ability to Connect to the following platforms - AWS,Azure,Google Cloud,
Snowflake(a data ops platform)
The tools below help you manage data assets, google like user interfaces, ease of managing data lineage
Alation - https://www.alation.com/solutions/data-governance/
Atlan - https://atlan.com/
Zaloni (Arena) - https://www.zaloni.com/arena-overview/
Collibra - https://www.collibra.com/
Talend - https://www.talend.com/
Data Management Discipline with Python
The 4 C’s
Completeness — How much of the data that’s expected is there? Are the ‘important’ columns filled out?
Consistency — Are data values the same across data sets?
Conformity — Does the data comply with specified formats?
Curiosity — Are stakeholders knowledgeable and/or engaged in the data management lifecycle?
The 4 V’s
Volume — How much information is present?
Velocity — What’s the frequency of incoming data? How often is the data processed and propagated?
Variety — What types of data exist? Is the data structured, unstructured or semi-structured? Veracity — Is the data trustworthy? What are the inherent discrepancies?
import pandas as pd
import pandas_profilingdf = read_csv("raw_data.csv")
df.describe()profile = pandas_profiling.ProfileReport(df)
profile.to_file("profile.html")
https://towardsdatascience.com/automate-your- data-management-discipline-with-python- d7f3e1d78a89