Saturday, July 3, 2021

Data Governance - Data Management

Recently I had the opportunity to present at Datavader, a Data Science Learning Portal run by Shruthi Pandey. The topic of presentation was on Data Governance, it was a fun and interactive session with good participation given the importance of the topic. We had a Q&A Session at the end of the presentation. I am attaching some of the content from the deck of my Presentation, it is a simple compilation of articles , diagrams from different blogs, websites to convey the importance of Data Governance.

Datavader link : https://datavader.circle.so/home

Best Data Governance Tools

Here are a few Data Governance tools, these are Just a Sample:

All of the tools below provide the following:
The tools have a common theme - To provide a Modern Day Data Governance Platform and

Data Catalog Platform.
The tools have the ability to Connect to the following platforms - AWS,Azure,Google Cloud, Snowflake(a data ops platform)

The tools below help you manage data assets, google like user interfaces, ease of managing data lineage

Alation - https://www.alation.com/solutions/data-governance/ 

Atlan - https://atlan.com/
Zaloni (Arena) - https://www.zaloni.com/arena-overview/ 

Collibra - https://www.collibra.com/

Talend - https://www.talend.com/

Data Management Discipline with Python

The 4 C’s

Completeness — How much of the data that’s expected is there? Are the ‘important’ columns filled out? 

Consistency — Are data values the same across data sets?
Conformity — Does the data comply with specified formats?
Curiosity — Are stakeholders knowledgeable and/or engaged in the data management lifecycle?

The 4 V’s

Volume — How much information is present?
Velocity — What’s the frequency of incoming data? How often is the data processed and propagated? 

Variety — What types of data exist? Is the data structured, unstructured or semi-structured? Veracity — Is the data trustworthy? What are the inherent discrepancies?

import pandas as pd
import pandas_profilingdf = read_csv("raw_data.csv") df.describe()profile = pandas_profiling.ProfileReport(df) profile.to_file("profile.html")

https://towardsdatascience.com/automate-your- data-management-discipline-with-python- d7f3e1d78a89


No comments:

Post a Comment