tag:blogger.com,1999:blog-24376517273706258182024-03-13T23:03:52.396-07:00Ram's BlogBlog about Business Intelligence,Data Science and Big Data Technologies.BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.comBlogger334125tag:blogger.com,1999:blog-2437651727370625818.post-68529065765644407972021-08-29T13:39:00.000-07:002021-08-29T13:39:29.418-07:00Exploratory Data Analysis - EDA<p> One of the key aspects of getting into AI,ML is to understand the data, what the data is trying to represent, is there data a god sample for the business problem one is trying to solve. In order to answer these questions, there needs to be scientific way to understand the data, so this is where Exploratory data analysis come into picture. In order to do EDA once can use tools or it is also possible to start getting to know the patterns with tools such as Excel. One need not feel the pressure of not knowing tools such s Python or R. You can gradually learn and getting into those techniques available in the tools.</p><p>The first step is to get the data from a trusted source, make sure it is a source that is relevant to the business problem we are trying to solve. Now the data can be in different forms such as text files, csv, excel spreadsheets, logs or relational databases. Once you identify the source, choose a platform you want to bring the data into. For example let us assume spreadsheet is how you want to analyse the data.</p><div style="text-align: left;">1. Get the data formatted into a spreadsheet, if you are using a tool like python, read the data into a dataframe using pandas.<br />2. Understand the Business problem you are trying to solve, also know the domain area of the business.<br />3. Try as much as possible not to be biased, review the data points.<br />4. Discard data points that are redundant, would not offer any value. In case you have doubts ask the SME or folks who know the business/data.<br />5. Identify how many missing observations are there. if there is lot of data missing, discuss with the data source and get valid data.<br />6. A certain margin rate of missing values is acceptable, some say 5%. Identify those data attributes.</div><div style="text-align: left;">7. Begin to classify data, one can use pivot tables/charts in excel, use pandas in python to start grouping/categorizing data.</div><div style="text-align: left;">8.Document the trends, there are techniques like First Principles thinking (where one does not make any biases ,starts to analyze data from scratch).</div><div style="text-align: left;">9. Review if the data is balanced, for example if you are doing a batch prediction model, based on success/failures of daily jobs. In this case if there are more than 80% success or 80% failures then the data is skewed. Imbalance in data can cause models to generate bad predictions.</div><div style="text-align: left;">10. Add features to data, meaning add additional data points to enhance the data like adding categorical variables.</div><div style="text-align: left;">11. Once you have documented all your findings, you can proceed to the next step in determining the type of model needed.</div><div style="text-align: left;">Here is a link to First Principles thinking</div><div style="text-align: left;"><a href="https://jamesclear.com/first-principles">https://jamesclear.com/first-principles</a></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-67071926222367004832021-07-31T13:16:00.002-07:002021-07-31T13:16:27.928-07:00Kafka - Event Processing<div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">In a traditional data world which started to come into picture in the 1980s and become very popular during the 2000's is the concept of storing data based on entities/structured schema.</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">The field of Business Intelligence rose exponentially during this period and was the main area driving analytics. There was reporting, dashboards which started growing due to the flourish of BI. The infrastructure driving this was mainly servers, data centers, relational databases and in order to scale the operations there was a concept of web farms, clustering. Also with this type of data structure, in case we had to track changes in data or entities, we relied on triggers, writing stored procedures. This was fine initially to provide such information to business and stakeholders. As the amount of data grew, requirements became complex and more time sensitive, there was a need to move to a scalable architecture. With the advent of big data technologies, there was one technology that is grown in popularity and usage, it is KAFKA.</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">Here are some basic concepts in Kafka, there lot of online tutorials on more in depth explanation of Kafka</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><b>Producer</b> - Sends a Message Record (data), array of bytes</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">All records in table will be sent as message - Collect the result from query and sending each row as a message</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">You need to create a Producer application</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><b>Consumer</b> - is an application that receives the data.</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">Producer sends data to Kafka server then requesting data from this is a Consumer</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">Producer -> Kafka Server -> Sends data to consumer</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><b>Broker</b> is also the Kafka Server, it is a broker between Producer and consumer</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><b>Cluster</b> - Can contain multiple brokers</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">Topic is a name given to a data set/stream</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">For example a topic can be called as Global Orders</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><b>Partition</b> - Broker could have challenge in storing large amounts of data. Kafka can break a topic into partitions. How many partitions are needed, we need to make that decision for a topic. Every partition sits on a Single machine</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><b>Offset</b> is a sequence number of message in a partition. Offsets starts from 0 for a message, they are local to the partition. To access a message Topic name, Partition Number, Offset number.</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><br /></div><div style="background-color: white; color: #222222; font-family: Calibri;"><div><span style="font-family: Calibri;"><b>Consumer Group</b> - Group of consumers , members of the group share the work</span></div><div><span style="font-family: Calibri;">Retail chain</span></div><div style="padding-left: 27pt;"><span style="font-family: Calibri;">Billing counters</span></div><div style="padding-left: 54pt;"><span style="font-family: Calibri;">Producer for each Billing location</span></div><div style="padding-left: 54pt;"><span style="font-family: Calibri;">Sends message</span></div><div style="padding-left: 54pt;"><span style="font-family: Calibri;">Consumer will get the above messages</span></div><div style="padding-left: 54pt;"><span style="font-family: Calibri;">Create clusters and also create partition.</span></div><div style="padding-left: 54pt;"><span style="font-family: Calibri;">Consumer groups can then access a set of partitions</span></div><div style="padding-left: 54pt;"><br /></div></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><br /></span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;">These are some basic concepts in Kafka...</span></div><div style="background-color: white; color: #222222; font-family: Calibri;"><span style="font-family: Calibri;"><br /></span></div><div style="background-color: white; color: #222222; font-family: Calibri; padding-left: 54pt; text-align: left;"><br /></div>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-77707800685406152302021-07-03T07:06:00.006-07:002021-07-03T07:06:58.654-07:00Data Governance - Data Management<p>Recently I had the opportunity to present at Datavader, a Data Science Learning Portal run by Shruthi Pandey. The topic of presentation was on Data Governance, it was a fun and interactive session with good participation given the importance of the topic. We had a Q&A Session at the end of the presentation. I am attaching some of the content from the deck of my Presentation, it is a simple compilation of articles , diagrams from different blogs, websites to convey the importance of Data Governance.</p><p>Datavader link : <a href="https://datavader.circle.so/home">https://datavader.circle.so/home</a></p><p>
</p><div class="page" title="Page 3">
<div class="section" style="background-color: rgb(94.510000%, 94.120000%, 91.760000%);">
<div class="layoutArea">
<div class="column">
<p><span style="color: #5271ff; font-family: Antonio; font-weight: 700;">Best Data Governance Tools
</span></p>
<p><span style="font-family: OpenSans; font-weight: 300;">Here are a few Data Governance tools, these are Just a Sample:
</span></p>
<p><span style="font-family: OpenSans; font-style: italic; font-weight: 300;">All of the tools below provide the following:<br />
The tools have a common theme - To provide a Modern Day Data Governance Platform and
</span></p>
<p><span style="font-family: OpenSans; font-style: italic; font-weight: 300;">Data Catalog Platform.<br />
The tools have the ability to Connect to the following platforms - AWS,Azure,Google Cloud,
Snowflake(a data ops platform)
</span></p>
<p><span style="font-family: OpenSans; font-style: italic; font-weight: 300;">The tools below help you manage data assets, google like user interfaces, ease of managing data
lineage
</span></p>
<p><span style="font-family: CAGenerated;">Alation - https://www.alation.com/solutions/data-governance/ </span></p><p><span style="font-family: CAGenerated;">Atlan - https://atlan.com/<br />
Zaloni (Arena) - https://www.zaloni.com/arena-overview/ </span></p><p><span style="font-family: CAGenerated;">Collibra - https://www.collibra.com/
</span></p>
<p><span style="font-family: CAGenerated;">Talend - https://www.talend.com/
</span></p><p>
</p><div class="page" title="Page 4">
<div class="section" style="background-color: rgb(100.000000%, 100.000000%, 100.000000%);">
<div class="layoutArea">
<div class="column">
<p><span style="color: #8c52ff; font-family: Antonio; font-weight: 700;">Data Management </span><span style="color: #8c52ff; font-family: Antonio; font-weight: 700;">Discipline with Python
</span></p>
<p><span style="font-family: CAGenerated;">The 4 C’s
</span></p>
<p><span style="color: #5271ff; font-family: Antonio; font-weight: 700;">Completeness </span><span style="font-family: Antonio; font-weight: 700;">— How much of the data that’s expected is there? Are the ‘important’ columns filled out? </span></p><p><span style="color: #5271ff; font-family: Antonio; font-weight: 700;">Consistency </span><span style="font-family: Antonio; font-weight: 700;">— Are data values the same across data sets?<br />
</span><span style="color: #5271ff; font-family: Antonio; font-weight: 700;">Conformity </span><span style="font-family: Antonio; font-weight: 700;">— Does the data comply with specified formats?<br />
</span><span style="color: #5271ff; font-family: Antonio; font-weight: 700;">Curiosity </span><span style="font-family: Antonio; font-weight: 700;">— Are stakeholders knowledgeable and/or engaged in the data management lifecycle?
</span></p>
<p><span style="font-family: CAGenerated;">The 4 V’s
</span></p>
<p><span style="color: #ff914d; font-family: Antonio; font-weight: 700;">Volume </span><span style="font-family: Antonio; font-weight: 700;">— How much information is present?<br />
</span><span style="color: #ff914d; font-family: Antonio; font-weight: 700;">Velocity </span><span style="font-family: Antonio; font-weight: 700;">— What’s the frequency of incoming data? How often is the data processed and propagated? </span></p><p><span style="color: #ff914d; font-family: Antonio; font-weight: 700;">Variety </span><span style="font-family: Antonio; font-weight: 700;">— What types of data exist? Is the data structured, unstructured or semi-structured?
</span><span style="color: #ff914d; font-family: Antonio; font-weight: 700;">Veracity </span><span style="font-family: Antonio; font-weight: 700;">— Is the data trustworthy? What are the inherent discrepancies?
</span></p>
</div>
</div>
<div class="layoutArea">
<div class="column">
<p><span style="color: #ff5757; font-family: OpenSans; font-style: italic; font-weight: 700;">import pandas as pd<br />
import pandas_profilingdf = read_csv("raw_data.csv")
df.describe()profile = pandas_profiling.ProfileReport(df)
profile.to_file("profile.html")
</span></p>
</div>
<div class="column">
<p><span style="color: #53ac4d; font-family: OpenSans; font-style: italic; font-weight: 700;">https://towardsdatascience.com/automate-your-
data-management-discipline-with-python-
d7f3e1d78a89
</span></p>
</div>
</div>
</div>
</div><br /><p></p></div></div></div></div><p></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-7982181807700525102021-05-27T13:16:00.004-07:002021-05-27T13:16:37.144-07:00Data Science Life Cycle<p>Today in many organizations especially where data projects are being rolled out or data is being sold as product , one of the important aspects to be considered is the data science life cycle. This becomes very important when one is trying to use insights to drive the business forward. This could be in the following areas, I just picked a sample.</p><p style="text-align: left;">1. Revenue Generation<br />2. Customer Experience<br />3. Customer attrition<br />4. Increase Engagement</p><p style="text-align: left;"><b><u>Here are some of the key Steps in a DS Life cycle:</u></b></p><div style="text-align: left;">1. Identification of Data Sources. Data (Structured, Unstructured)<br />2. Data Extraction</div><div style="text-align: left;">3. Data Storage (Persistence of Data), what type of Data storage - On Prem, Public Cloud, Private Cloud or Hybrid Cloud.</div><div style="text-align: left;">4. Data Engineering - Data Profiling,Data Quality Checks, Cleansing of Data, Data Transformation</div><div style="text-align: left;">Handling of bad data from Source.</div><div style="text-align: left;">5. Data Science Layer - This includes the following:</div><div style="text-align: left;"><span> a. Data Transformation - Addition of Categorical Variables</span><br /></div><div style="text-align: left;"><span> b. Feature Engineering - Enhance the data set to make it more meaningful - requires domain expertise.</span></div><div style="text-align: left;"><span> c. Model Training, Testing and Validation</span></div><div style="text-align: left;"><span> d. Deployment of Models to Production</span></div><div style="text-align: left;"><span> e. Documentation of Models</span></div><div style="text-align: left;"> f. All of the above Steps could be iterative s new data keeps coming in</div><div style="text-align: left;">6. Data Delivery - How data is going to be sent to end users, stakeholders</div><div style="text-align: left;"> a. Method of Data Delivery - Web Services, Visualizations, Reports</div><div style="text-align: left;"> b. Actionable insights - There needs to be action taken on the insights provided. Today in organizations there is an issue of these insights not even being looked at. How does an organization go about this is very important.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Data Lineage/Governance</b> - This step would cover all aspects of the steps mentioned here. This includes Data catalog, Metadata repository, Data Value Chains, Capturing important data assets. How data is being represented cross the enterprise.</div><div style="text-align: left;"><br /></div><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-461227272415195612021-04-22T13:37:00.005-07:002021-04-22T13:37:34.662-07:00Data Ingestion-Hive/Impala/HUE<p> With the emergence of all the big data, cloud technologies, there is a major push to get data moved from legacy systems to cloud. The reason for the migration is to provide better data access, allow for more advanced analytics, provide better insights and better customer experience. This type of migration is also to provide more self service capabilities. One of the initiative that i have been involved with is to move data from oracle databases to Hadoop. There are lot of pieces involved in moving data from RDBMS to hadoop. Here are some steps involved that i would like to share along with the tools involved.</p><p>1. Extraction of data from RDBMS to HDFS - What type of tools available: Sqoop works for pulling data from rdbms to HDFS, now one thing to watch out for is performance, how big are the tables. The data is extracted in the form of text files. So it is good to be clear of what type of data needs to be on HDFS. There are other tools like Zaloni(Arena),Atlan, some of these work well with Public Cloud providers.</p><p>2. Design the the structures where the files will be stored - The approach i am using is to utilize HIVE where in we have different schemas depending on the classification of data. Create tables in thee schemas and pull i the data using the text files created from Step 1. There are 2 main types of HIVE tables - Internal and External. HIVE supports SQL, it is very similar to SQL in RDBMS with few variations. There is UI available called HUE which is Hadoop User Experience. This is web based, so with the right permissions, one can connect to HIVE. Once you login, you can view the schemas/tables that have been created in HIVE. You can write SQL type queries and view results. HIVE is Apache based, where Impala is Cloudera based. The difference is that in HIVE you have metadata and query capability whereas in IMPALA you mainly work with queries</p><p>3. Typically when you first load data into HIVE, they are text files, basically in Text format. These text files are loaded into tables. Now when you are ready to make the data available for processing, i decided to create separate schemas where the data is still stored in tables but it is stored in Parquet format. This allows for compression , it is in columnar format. For analytics purposes this works well and other thing is for our installations we use parquet. There are other big data formats that can be used:</p><p><b>Avro, Parquet, ORC, see the link below for comparisons.</b></p><p><a href="https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/" target="_blank">https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/</a><br /></p><p>4. Once Step 1 through 3 is completed then you are in a position to start provisioning the data, typically this data can feed into a data lake. The data lake can be on premise or it could be a private or a public cloud based on the business need,</p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-32958527623252851562021-04-05T13:52:00.006-07:002021-04-06T10:12:59.359-07:00Data Processing Unified - Omniscope<p> There are lot of tools coming out in the market related to data management, analytics and visualization. It is also becoming harder for data practitioners to choose tools, so it is natural that we also expect to see consolidation of tools that handle data sourcing, management, transformation and visualization. One such tool that i got to see demonstrated is <b><u>Omniscope</u></b>. The link to get to see what the tool offer, please see here:<a href="https://visokio.com/" target="_blank">https://visokio.com/</a>. The tools goal is to unify the following components of a data journey</p><div style="text-align: left;"><ol style="text-align: left;"><li>Data Sourcing</li><li>Data Blending - <i><b>This is the step where you can combine data from different data sources. This will be very handy when say you are trying to combine customer information that is on a system different from Borrower/Loan information</b></i>. <b><i>This is an example, you will have scenarios where you have to perform this step.</i></b></li><li>Data Transformation - <b><i>This is the step where you take the source or the blended data and add additional attributes/calculations based on your end business goal.</i></b></li><li>Data Lineage - <b><i>The functionality that provides the journey of a data element as it flows from source to different target systems. This helps in also understanding how a data element is used in different calculations/reports.</i></b></li><li>Data Visualization - <b><i>The layer that displays the data in different forms (like graphs, charts, plots)</i></b></li></ol></div><p>The tool helps data curators/scientists/analyst to maintain the pipeline of data coming through and how it is to be consumed by users. There are plenty of visualizations available in this tool and the color schemes used are very good. This tool also attempts to maintain lineage of the data that is used for the report, thereby providing an opportunity for end to end governance. In addition to all of these, there are features related to scheduling and automation. Please refer to the link above for more in depth look at the features of this tool.</p><p><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-77589581670643595982021-03-17T10:25:00.005-07:002021-03-17T10:25:23.718-07:00Data Catalog/Documentation<p>One of the key aspects of Data Governance is the ability to have very good data documentation. Here we are discussing about the nature of data being collected. It is very important to have this information documented as this can help determining how the data can be used in various applications, also in building Machine Learning Models. I am including here a very good blog post written by <b>Prukalpa Shankar (Co Founder of Atlan)</b> . In this article the 5W1H framework is discussed extensively: What is the 5W1H, </p><div style="text-align: left;"><ol style="text-align: left;"><li>WHAT</li><li>WHO</li><li>WHY</li><li>WHEN</li><li>WHERE</li><li>HOW</li></ol></div><p><a href="https://towardsdatascience.com/data-documentation-woes-heres-a-framework-6aba8f20626c">https://towardsdatascience.com/data-documentation-woes-heres-a-framework-6aba8f20626c</a><br /></p><p>Excellent insights are provided in the above post, please read and Keep Learning.</p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-89852676094486437192021-02-26T10:25:00.000-08:002021-02-26T10:25:14.488-08:00Data Management-Agile Methodology<p> Data today is spread everywhere in organization, the rate at which data is incoming into an organization is increasing at a very rapid rate, with various touch points to collect data. Technology is also evolving at a rapid pace, so it has/is becoming very important for organizations to streamline data ingestion and catalog them appropriately. With Data strategy being closely tied with Business strategy, it has become very critical to deliver Business value effectively and quickly. In the past years there was a traditional approach being followed for data management. In this approach there were long lead times and the end product delivered was either late or did not meet the requirements. Fast Forward we are now in the age of agile delivery, devops, Continuous deployment and Integration. In the agile world the focus is to deliver incremental business value incrementally. How do we tackle data projects in the agile world, it is not cut and dry in the data world, there are lot of dependencies both on source systems, also there are provisioning systems that need data within SLA's. In order to address some of these challenges, there some key points that can be incorporated plus make use of tools that incorporate AI techniques.</p><div style="text-align: left;">1. Leadership should embrace Agile top down for Data Projects and there should be bottom up feedback on how agile is working for these projects.<br />2. Leaders/Business partners should provide framework, remove roadblocks, runways that would help an organization adopt Agile. There should be a mindset to tear down methods that wouldn't work in a modern enterprise, both business/technology should come closer together to deliver solutions.</div><div style="text-align: left;">3. Collaboration should be nurtured, allow the business and technology conversations to happen. There will be role specific responsibilities but that should not provide a roadblock to agile adoption.</div><div style="text-align: left;">4. Budgeting of activities/work need to change to adopt techniques like Activity Based Costing so that features/epics/deliverables can be funded accordingly.</div><div style="text-align: left;">5. Architecture needs to speed up adoption of latest generation Data Management Tools like Atlan,Arena and DQLabs in order to facilitate more efficient data ingestion, data profiling/quality and build effective lineage.</div><div style="text-align: left;">6. Availability of quality test data or have a framework to generate test data efficiently, this is real key to move along the work in the agile pipeline and have work ready for deployment. A key part of this is to have the ability to obfuscate the data especially when working with sensitive information.</div><div style="text-align: left;">7. One of the key aspects of modern data platforms is to have the metadata/catalog evolve alongside the data pipelines that are being built. In such scenarios the right set of data management tools can reduce technical debt. This is a very crucial aspect, identifying and handling this can limit the amount of work to fix data gap issues.</div><div style="text-align: left;">8. All of the above points need to come together so that you can evolve the data platform and data management in a agile way and match to the speed of the business. Business, Technology, Architecture, Stakeholders all have a role/responsibility to make Agile Data Management happen.</div>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-32483758884410427242021-02-11T05:59:00.003-08:002021-02-11T05:59:22.189-08:00Data Transformation for Cloud - dbt<p><span style="font-family: arial;"> In this blog post we will focus on loading data from a valid source to a cloud data platform like snowflake. There are different tools available in order to do this, one of the tools that is gaining transaction is dbt,(<a href="https://docs.getdbt.com/">https://docs.getdbt.com/</a>). One of the main highlights of dbt is that it uses SQL for doing lot of the data transformations/loading into a cloud data warehouse like snowflake. There are certain additions that dbt has on top of using SQL that makes it very flexible for the ETL/ELT purposes. There are 2 ways to use dbt, one is to use the Command Line Interface, the other one is dbt cloud. There are lot of configurations available which can be set up to make the data transformation process efficient and effective. The core concept involved in dbt is called the models. dbt uses models extensively to create table/views on the cloud data warehouse. The order of creating the tables and views in the cloud data warehouse is taken care of dbt by using the concept of models. Models allows one to define the base objects and relationships.</span></p><p><span style="font-family: arial;">In order to connect to the different data sources, there adapters available that dbt provides. These adapters allows dbt to connect to the datasource and load data into the target cloud data warehouse. For a list of the adapters available, please check the following link: <a href="https://docs.getdbt.com/docs/available-adapters" target="_blank">https://docs.getdbt.com/docs/available-adapters</a>. The adapters are primarily for cloud data warehouses/data lakes like Snowflake,Redshift,Bigquery. In order to start using dbt one has to create dbt project, to quote from dbt: <span style="color: #1c1e21;">A dbt project is a directory of </span><code style="border-radius: var(--ifm-code-border-radius); box-sizing: border-box; color: #1c1e21; margin: 0px; padding: var(--ifm-code-padding-vertical) var(--ifm-code-padding-horizontal);">.sql</code><span style="color: #1c1e21;"> and </span><code style="border-radius: var(--ifm-code-border-radius); box-sizing: border-box; color: #1c1e21; margin: 0px; padding: var(--ifm-code-padding-vertical) var(--ifm-code-padding-horizontal);">.yml</code><span style="color: #1c1e21;"> files, which dbt uses to transform your data. </span></span></p><p><span style="color: #1c1e21; font-family: arial;"><a href="https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.">https://docs.getdbt.com/docs/building-a-dbt-project/using-sources.</a></span></p><p><span style="color: #1c1e21; font-family: arial;">Typically in a ETL/ELT operation there are some considerations that need to be taken into account for loading data:</span></p><div style="text-align: left;"><span style="font-family: arial;"><span style="color: #1c1e21;">1. Is the data load into the cloud data warehouse going to be full refresh, if so how many tables follow this loading type.<br /></span></span><span style="font-family: arial;"><span style="color: #1c1e21;">2. Is the data load into the cloud data warehouse going to be incremental? If so how many tables follow this loading type.</span></span></div><div style="text-align: left;"><span style="font-family: arial;"><span style="color: #1c1e21;">3. Are there going to be materialized views that need to be created?</span></span></div><div style="text-align: left;"><span style="font-family: arial;"><span style="color: #1c1e21;">4. Is the warehouse going to have slowly changing dimension tables?</span></span></div><div style="text-align: left;"><span style="font-family: arial;"><span style="color: #1c1e21;">5. How are the relationships going to be defined.</span></span></div><div style="text-align: left;"><span style="font-family: arial;"><span style="color: #1c1e21;"><br /></span></span></div><div style="text-align: left;"><span style="font-family: arial;"><span style="color: #1c1e21;">Based on the on the above factors and the need of the business, all of the above choices can be implemented in dbt. When one has lot of data being sourced and needs to be used for analytic purposes, it is not possible to do full refreshes everyday. One might have to look at loading the data incrementally, to meet the SLAs and have improved performance. dbt uses the concept of snapshot in order to determine source freshness, this tells the use if the data at source has been updated and can be pulled into the data store. Quoting from dbt website: "</span></span><span style="color: #1c1e21; font-family: system-ui, -apple-system, "Segoe UI", Roboto, Ubuntu, Cantarell, "Noto Sans", sans-serif, system-ui, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px;">This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse."</span></div><div style="text-align: left;"><span style="color: #1c1e21; font-family: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, system-ui, Segoe UI, Helvetica, Arial, sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol;">I hope you find the information here useful, designing proper data load and transformation strategies is key to having good data pipelines.</span></div><p><span style="font-family: arial;"><span style="color: #1c1e21;"><br /></span></span></p><p><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-38786543979311883832021-02-08T06:55:00.004-08:002021-02-09T12:51:26.213-08:00Data Pipelines/Data Transformation<p><span style="color: #2b00fe;"> <span style="font-family: arial;">Data is available in abundance in organizations, more so in bigger companies. How do they make good use of the data to get valuable insights and add business value is the key strategic question in companies today. Early on there were lot of data warehouses, data marts built using complex ETL techniques, then there was this concept of ELT (Extraction, Loan and Transform) which was used to transport data. With the advent of Big data, AI/ML techniques there has been continuous need to improve the data integration in order to have successful data projects. One of the challenges companies have had is how to handle the flow of data in order to gain maximum value. There have been approaches such have one set of folks help with the sourcing of data, another set to transform and finally have the data science folks figure out the value. This approach has caused too many hand off points, and also caused lot of silos of expertise. In order to address this problem, a new concept is being used lot of organizations recently, it is called the Data Pipeline.</span></span></p><p><span style="font-family: arial;"><span style="color: #2b00fe;"><b><u>What is a Data Pipeline</u></b>: It is the complete end to end process of getting the data from source and build the complete lifecycle (Source, transform, data quality checks, AI/ML model generation and Data visualization). Using this concept, resources are now being engaged to manage the complete pipeline, Data scientists/Analyst are being encouraged to manage and/own data pipelines. There are different tools available that help you manage the data pipeline. In some cases the data pipeline is also referred as workflows. The tools that are available today that help with these concepts are <b>dbt:</b><a href="http://getdbt.com" style="font-weight: bold;">getdbt.com</a><span><b> </b>(</span></span><span style="background-color: white; color: #2b00fe;">Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.)<b>, <i>please refer to </i></b></span></span><span style="color: #2b00fe; font-family: arial;"><b><i><a href="https://docs.getdbt.com/docs/introduction">https://docs.getdbt.com/docs/introduction</a>.</i></b></span></p><p><b style="color: #2b00fe; font-family: arial;"><u>Matillion</u>,<a href="https://www.matillion.com/">https://www.matillion.com/</a>, <u>Atlan </u>(Data Management Tool), <u>Arena</u> from Zaloni and host of other tools like Collibra</b><span style="background-color: white; color: #2b00fe; font-family: arial;">. The tools here help provide a complete end to end perspective of the data that is being used for different types of data projects. The tools mentioned above are cloud ready, with companies making the move to the cloud to store all of the data, the adoption of the above tools hopefully is much easier.</span></p><p><span style="color: #2b00fe;"><span style="background-color: white; font-family: arial;">When a data scientist and/or analysts has the exposure to the tools mentioned above, they get a complete perspective of the data, understand the lineage which would help in building out better AI/ML Models that can be used by the companies. Lot of AI/Ml efforts fail because of bad data, unable to understand the lineage and dependencies. One of the key aspects to keep in mind while working on data projects is <b><u>Data Drift,</u> what does this mean?</b> Data is not static, it keeps changing constantly, the structure of the data could change. There can be changes in schema, the granularity of data and the volume of data can keep fluctuating. When you have the tools that have been mentioned in the blog post earlier, they help understanding these changes to a great extent and help tuning the AI/ML Models. There is a lot more research made in the are of data drift and related tools. I will update on those in a different blog post.</span></span></p><p><span style="background-color: white; color: #2b00fe; font-family: times;"><br /></span></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-18526077825983249642021-01-11T13:01:00.004-08:002021-01-11T13:01:34.200-08:00Data Changes/Data Quality-AI/ML<p> <span style="background-color: white;"><span style="color: #2b00fe;">With AI/ML getting increased adoption in organizations, Data is fast becoming the most sought after asset. The flow of data is becoming more continuous, just like opening a tap and water flows.One of the trends i have noticed in data projects is that once the desired goal is achieved, it seems like there is a tendency to think that it is done. This is where the problem starts, data is not static, it is subject to changes or continually changes. When changes in data happens, it can affect the underlying structure or grain of the tables which are used in AI/ML Projects. If these changes are not determined or tracked upfront, it can cause the AI/ML Models generate inaccurate results. So the AI/ML models which were very efficient three months ago , now are not accurate anymore.</span></span></p><p><span style="background-color: white;"><span style="color: #2b00fe;">Tracking data changes either in terms of format coming from a vendor or schema changes in the source is very time consuming and inefficient/manual. There is very good opportunity to identify this as a problem area and look for if this gap can be addressed by providing automated solutions. With infrastructure and related components we have lots of utilities/dashboards that indicate if certain parameters are reaching a limit nd adjustments are needed. Using the same analogy we would need to have tools that can tell the data teams/business users any changes happening in data. This would help the data teams spend more time on tuning/optimizing the data models and not spend too much time on manual/inefficient tasks. Handling data changes is very critical for product management, with increasing reliance of product management on data. This is a little more of a abstract post but these are high level patterns that i have seen in the data related projects.</span></span></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-65166292800464489852020-12-15T11:58:00.006-08:002020-12-15T11:58:50.994-08:00AI/ML Framework<p>There is a lot of focus on Artificial Intelligence, Machine Learning in various organizations and they want to get better insights and value out of such efforts. There are significant investments made in terms of time, money to get AI projects started and delivered. One of the key aspects to kept in mind , the success of such projects depends on sourcing the right date, have the right data governance and make sure such efforts align with the business goals. Given such a environment it is imperative that there is a general framework in place as to how AI Projects are executed, so that there are truly value added deliverables being achieved. In a Data world we have the following broad roles trying to make a AI Project a success.</p><p><u><b>Possible Personas in a data project.</b></u></p><div style="text-align: left;">1. Data Engineers</div><div style="text-align: left;">2. Data Analysts</div><div style="text-align: left;">3. Data Scientists</div><div style="text-align: left;">4. AI/ML Developers, Model Creators</div><div style="text-align: left;">5. End users/Reviewers of Model for Compliance and regulations</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Each of the above personas are going to be involved in different stages of a data project and would be utilizing different tools to achieve the end goal. When we talk about different stages we could identify that the following steps would be there in any data project, they are:</div><div style="text-align: left;"><br /></div><div style="text-align: left;">1.<b> Sourcing data </b>- <b><i><span style="color: #2b00fe;">Sample Tools IBM DataStage, Informatica, Structured Database Oracle, SQL Server, Teradata, Unstructured Data: Infobase</span></i></b></div><div style="text-align: left;">2. <b>Organize data/Metadata Management/Lineage Analysis/Cataloging </b>- <b><i>T<span style="color: #2b00fe;">rifacta/Alteryx for Data Wrangling/Prep, Data Catalog/Governance Atlan/Collibra</span></i></b></div><div style="text-align: left;">3. <b>Build Model/Experiments/Analysis</b> - <b><i><span style="color: #2b00fe;">SAS, JUPYTER (Python), R, H2O</span></i></b></div><div style="text-align: left;">4. <b>Quality Check, Deployments of ML Models </b>- <b><i><span style="color: #2b00fe;">Model Frameworks Such as H20, Python, R</span></i></b></div><div style="text-align: left;">5. <b>Consumption of Data By end Users</b> - <b><i><span style="color: #2b00fe;">Tableau, Cognos, Microstrategy</span></i></b></div><div style="text-align: left;"><br /></div><div style="text-align: left;">All of the above would require storage and consumption component, these could be <b><i>Hadoop/Spark, AWS/SnowFlake, Azure or Google Cloud. </i></b></div><div style="text-align: left;"><br /></div><div style="text-align: left;">When you organize the different tool sets and personas it will provide an overview of what is available and how AI projects could be structured to deliver effective value. For example if there is a lot of need to consume unstructured data and derive data from it in a financial institution for example, once could look at leveraging the Instabase Platform : Quoting from the site: <span style="background-color: #f4f4f4; color: #666666; font-size: 16px;"><span style="font-family: times;">The biggest opportunity for efficiency within a business is stifled by document processes that only humans can do. With Instabase, you can build workflows to access information in complex documents and automate business processes.</span></span><a href="https://about.instabase.com/">https://about.instabase.com/</a>. </div><div style="text-align: left;">Providing a platform where all of these tools can be used so that different personas can have access and move data from one point to another would be of great help for any data project. This would allow consistency of operation , help in tracing data, provide better model validation, make sure any audit/compliance requirements are met.</div><div style="text-align: left;">Successful data projects that include AI/ML have the above ingredients in the right mix, well tracked/cataloged and also take care of changes in data over time plus closely align with the business objectives.</div><div style="text-align: left;"><b>Happy Holidays and a Very Happy, pandemic free New Year 2021, Stay safe everyone.</b></div><div style="text-align: left;"><br /></div><p><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-70708055787824439932020-12-07T12:28:00.002-08:002020-12-07T12:29:03.860-08:00Snowflake - UI Components<p>Snowflake is fast becoming a very important component of the cloud migration strategies in the business/technology world today. I have been in discussion with different leaders, also in participating in different conferences there is a lot of interest in Snowflake. One good thing is that there is a trial period for snowflake which one can sign up for. You can go to the Snowflake web site and sign up for the trial. The sign up process for the trial is very straightforward and once you have it set up, you are provided with an option to go through the introductory material related to snowflake and there are some very good documentation related to different areas in Snowflake. Once you sign into snowflake you should see the following interface. There are Different components that are listed in the interface, lot of it looks very similar to the sql server management studio layout especially the object explorer. The main components in the UI are:</p><div style="text-align: left;">1. Database - Lists the Databases in Snowflake Instance.<br />2. Shares - This is related to Data Sharing within your Organization.<br />3. Data Marketplace - This option allows one to look at the Snowflake Data Marketplace, which lists the public data sources that are available to different categories like Government, Financial, Health, Sports<br />4. Warehouses - Lists the warehouses available on the snowflake instance.<br />5. Worksheets - The option where one can write SQL Queries like below. There is a sample database called DEMO DB which has list of tables that can be used for querying. Also these queries can be used for building out Dashboards that is available in the Data Market place option.</div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uz-5PFtRrDU/X86Go9DfkoI/AAAAAAAAInE/50yTisAlA0EL6LGArmP43eO_p0yA_xpdACLcBGAsYHQ/s2048/Screen%2BShot%2B2020-12-07%2Bat%2B2.45.45%2BPM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1076" data-original-width="2048" height="295" src="https://1.bp.blogspot.com/-uz-5PFtRrDU/X86Go9DfkoI/AAAAAAAAInE/50yTisAlA0EL6LGArmP43eO_p0yA_xpdACLcBGAsYHQ/w559-h295/Screen%2BShot%2B2020-12-07%2Bat%2B2.45.45%2BPM.png" width="559" /></a></div><div class="separator" style="clear: both; text-align: left;">One can also load data into snowflake based on the different data loading strategies that are available which has been discussed in earlier blog post written by me: <a href="https://www.blogger.com/blog/post/edit/2437651727370625818/3834832172078982453">https://www.blogger.com/blog/post/edit/2437651727370625818/3834832172078982453</a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><br /><p><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-13005193324821101882020-11-18T11:55:00.004-08:002020-11-18T11:55:52.800-08:00Data Cloud<p> One of the more commonly used buzz words these days is Data Cloud, it has been used as a marketing term mainly in the Cloud Domain across different business/organizations. There is a concept underneath the word Data cloud, it is mainly aimed at having data available in/migrated in the public cloud offerings such as Amazon, azure, Google Cloud. One of the key projects that have been undertaken by lots of organizations across different Business is how does one have data available in the cloud without compromising on the security aspect of data. I had the opportunity to attend listen to the data cloud summit 2020 organized by Snowflake. It was a virtual event organized by Snowflake where features of the snowflake was discussed in different sessions. There were also use cases presented by different Customers, Vendor Partners on how they are utilizing snowflake for the data projects and how much impact has this product had on their Business. There were some interesting points that i had picked up from different Sessions, i am listing them below. They cover a variety of topics related to Data.</p><div style="text-align: left;">1. <b><i>Compute/Storage</i></b>: Snowflake Separates compute from Storage, this is one of the main concepts in this product which has been highlighted by lot of customers. How this concept helps them in their daily data operations and business.<br />2. <b><i>Scalability</i></b> - Ability to ingest multiple workloads, this is a common requirement across all customers.</div><div style="text-align: left;">3. <b><i>Simplify</i></b>: Simplification of the Data Pipeline. How can one get the raw data and turn them into actionable insights quickly, this is called the lapse time. One of the questions raised is that does all the transformation of the data happen during early hours in the morning? Can this be spread out or done in real time?</div><div style="text-align: left;">4. <b><i>Data Silos</i></b>: Breaking down Data silos is a significant effort that is being undertaken by different organizations. Data Silos has a direct/indirect impact on cost and efficiency in a very negative way. One of the reasons for using a product like Snowflake is to break down the data silios, having data data in one place. This would allow better understandability ad searchability for the data in an organization.</div><div style="text-align: left;">5. <b><i>Proof of Value</i></b>: Data cloud products or cloud offerings need to provide a proof of value. It has to be tangible for the business, how does the investment in cloud provide better results for the business.</div><div style="text-align: left;">6. <b><i>Orchestration</i></b>: Since there is movement to cloud infrastructure taking place at different pace, there needs to be better orchestration across multiple cloud installations. This can lead to better abstraction, this is the challenge lot of companies are facing today.</div><div style="text-align: left;">7. <b><i>Data is an Asset</i></b>: Data can be monetized in the following ways: Generating value for the Business, reducing costs.</div><div style="text-align: left;">8. <b><i>Support:</i></b> Snowflake provides good support tools, Cost Effective. Some of the customers explained how the uptime of snowflake has been very good in spite of the huge data loads coming into the system.</div><div style="text-align: left;">9. <b><i>Data</i></b>: What type of information needs to be sent out/provisioned. One of the guest mentioned there are two important aspects with respect to data: 1. Information that a person needs to know, 2, How the information will affect you.</div><p>Overall lot of information in a single day event, i am sure each one of the aspects mentioned above can lead to deeper discussions and/or projects. The event provided a overall perspective of where things are headed in the data space and how companies are planning their work in the coming years.</p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-56135633460924185982020-11-09T10:01:00.005-08:002020-11-09T10:01:39.156-08:00Product Manager - Thoughts/Observations<p> One of the profession/roles that is talked about, discussed, in demand is the role of a product manager especially in the technology world. There is lot of enquiries or need for Product Managers, also the recent COVID crisis has challenged the business and hence lot of Product Managers also lost Jobs. A interesting trend i have noticed is that there are product managers who have a couple of years experience to folks who have 10 Years or more of work experience. It is a very broad spectrum and hence lot of questions are raised around who can be a good product Manager. Also noticed that there folks who want to become a Product Manager are the ones who do want to code in certain areas of business. Let me try to take a deeper look and pen down my Observations. In Some cases Product Manager roles have become glamorous in the sense that it feels nice to say one is a PM.</p><div style="text-align: left;">Product Manager in my discussions with colleagues/professionals is a very important and crucial role in a organization. The role is is at the intersection of the following:</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>1. Business</b></div><div style="text-align: left;"><b>2. Technology</b></div><div style="text-align: left;"><b>3. Customers</b></div><div style="text-align: left;"><br /></div><div style="text-align: left;">So a person who is approaching a PM Role needs to understand the dynamics of the above 3 components. There needs to be a understand of how the 3 components work together. What is the primary business of a n organization what products do they have for customers. Secondly what type of technology is used to build the products, thirdly who are your customers. In Summary one needs to understand high level picture plus also understand the details behind what is being delivered.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><u>Let us dwell a little deeper into each of the components:</u></div><div style="text-align: left;"><ul style="text-align: left;"><li><b>Business</b> - Understand the business strategy for the company and the Line of Business. Get a gauge on the stakeholders, understand the budget/resources that could be available for you product in terms of development/research/maintenance,What are the interfacing units and dependencies. How is the company performing financially and the target markets.</li><li><b>Technology</b>: Understanding The tools being used to develop the product, what type of vendor lock is there , is it based on a Open source Architecture with less of vendor Lock in. In terms of data, what are the data sources, are the data sources very disparate or well integrated. Are there opportunities to streamline the data. <b><i>One important aspect that is being experimented today is if Product management can be totally driven, if decisions could be justified by data.This is going to be even important in a data/information filled world.</i></b></li><li><b>Customers: </b>Getting continuous feedback from customers, conducting surveys/talking to customers to get feedback on product usage/issues faced. Conducting useability studies and getting them back into the product backlog. Adopting a agile approach to building a product/collaborating with them to get the proper engagement.</li></ul><div>In Summary, Product Manager is a exciting but a challenging role, it is imperative that one has the proper grooming/mentoring to get to a PM Role. There is a lot of temptation to cut corners(Like i won't do certain things...) to achieve it, but the consequences can be devastating that could erode self confidence. It would best to have a plan of action, set of goals and work with a mentor to achieve the results.</div></div>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-49401962933618329962020-10-25T13:44:00.000-07:002020-10-25T13:44:00.747-07:00Unlocking Insights-Data Quality<p> One of the main buzzword that we constantly hear is about insights or unlocking insights from data. This has been one of the main selling aspect when it comes to selling technology to the business. The sophistication of tools is a welcome feature to unearth the insights, at the same time what are the critical components in order to get meaningful insights? One of the fundamental requirement in order to get meaningful insights, is to what have a solid data pipeline end to end. In order to have this there needs to be the following in place:</p><div style="text-align: left;">1. <b>Data Governance/Lineage</b>.<br />2. <b>Metadata Quality/Entities</b>.<br />3. <b>Valid Domain Value Chain</b>.</div><div style="text-align: left;">4. Customer Data (Profile/Accounts/Interactions via different Channels)<br />5. <b>Data Quality</b> including Workflow/Validation/Data Test Beds/Deployment.<br />6. <b>Track Data Assets Related to a domain</b>.</div><div style="text-align: left;">7. <b>Business Data Owner</b> - A Person or a group of people who can help identifying Business purpose/meaning of all the data points in the Domain.</div><div style="text-align: left;">8. <b>Ability to Handle Technical Debt </b>- How to Systematically handle technical debt. A very common scenario in organizations grown by Merger and Acquisitions.</div><div style="text-align: left;">9. <b>Scale, Share and Speed</b> - Does the Architecture, Infrastructure available can handle the frequency/speed of data requests by business.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The elements mentioned above are very important, a good interplay of the above elements are needed in order to generate valid insights. For insights there are 2 main components</div><div style="text-align: left;">1. <b>Insight Rules</b> - Rules which are executed when certain events happen and certain business conditions are met.</div><div style="text-align: left;">2. <b>Insight Triggers</b> - Capture data points when certain events happen, for example there was a credit card transaction made at lowes or Home Depot, or someone paid a SAT entrance exam fee or a there was mobile deposit made. As part of this process there is also selection criteria around how are the transactions picked, also includes whether the insights are going to be triggered daily, weekly or a monthly basis.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The combination of the above 2 components can help generate insights, now assuming that the 8 elements mentioned above are satisfied or they are in place. It would be also advisable to categorize the Insights based on the Domain so that it would be easier to track and maintain the insights. There is constant mining of data that is being done, in order to generate accurate insights.</div><div style="text-align: left;">AI and ML are used very heavily when generating insights. The effectiveness of AI, ML becomes more apparent if the underlying data infrastructure is really solid.</div><div style="text-align: left;">The purpose of this blog post is to explain highlight the importance of solid data foundations needed to generate valuable insights for business and customers.</div><p><br /></p><p><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-56310619278289916682020-10-21T09:28:00.007-07:002020-10-21T09:28:50.341-07:00AI in Mortgage<p> AI has been permeating different aspects of life, Business and Technology, there are more sophisticated implementations of AI seeing the light of the day. There have been gains made with AI in terms of value added proposition with different types of Business. One of the areas where there has been lot of discussions and debates about the use of AI has been in the field of Mortgage. There have been lot of automated tools, chatbots, Quickens Rocket Mortgage and companies have been trying to implement their own versions of digital experience in the Mortgage Space.One of the challenges in Mortgage is that the processes still are complex, there are still traditional methods that are being adopted and a lot of dependencies given the wide nature of information that is needed for Mortgage. There are 3 components that need to come together in order to implement AI in Mortgage, they are <b>People, Process and Technology. </b>In Mortgage processes, when you apply for a loan or refinance a loan, usually there lot of documents that are needed. The processes for handling these have been sluggish to pretty decent, it does take a quite bit of time. Apps Like Rocket Mortgage and other bank offerings do seem to alleviate some of the pain points with respect to this process. The other aspect that been utilized to improve process efficiency is to move to the cloud platforms hopefully to streamline the data available from different data sources.</p><p>There are couple of ways to handle AI methods in Mortgage Space, one is to develop inhouse methods to use AI and ML techniques to automate mortgage process. The other option is to use any API available in a API marketplace and enhance the process. Given the recent developments in AI, Google has come up with a API called Lending DocAI,<span style="color: #333333; letter-spacing: -0.1px;"><span style="font-family: times;">is meant to help mortgage companies speed up the process of evaluating a borrower’s income and asset documents, using specialized machine learning models to automate routine document reviews, it is mentioned here:</span></span><span style="color: #333333; font-family: times;"><span style="letter-spacing: -0.1px;"><a href="https://techcrunch.com/2020/10/19/google-cloud-launches-lending-docai-its-first-dedicated-mortgage-industry-tool/">https://techcrunch.com/2020/10/19/google-cloud-launches-lending-docai-its-first-dedicated-mortgage-industry-tool/</a>. More details on the API is mentioned here:</span></span><a href="https://cloud.google.com/solutions/lending-doc-ai" style="color: #333333; font-family: times; font-weight: bold; letter-spacing: -0.1px;">https://cloud.google.com/solutions/lending-doc-ai</a><b style="color: #333333; font-family: times; letter-spacing: -0.1px;">. </b><span style="color: #333333; font-family: times; letter-spacing: -0.1px;">It is good to see companies like google are coming up with industry specific API offerings which can help improve </span><span style="color: #333333; font-family: times;"><span style="letter-spacing: -0.1px;">efficiencies</span></span><span style="color: #333333; font-family: times; letter-spacing: -0.1px;">. Expecting to see more on the same lines from other tech companies to solve business problems.</span></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-25136083962432855612020-10-16T12:42:00.001-07:002020-10-16T12:42:09.331-07:00Workflow, Data Masking - Data Ops<p> Dataops is becoming more prevalent in today's data driven projects, due to the speed at which these projects need to be executed and also be meaningful at the same time. There are tools in the Dataops space that are provide lot of different features, companies like Atlan, Zaloni are very popular in this space, in fact Atlan was named in the Gartner 2020 Data Ops Vendors list. Now coming to the different features needed in these tools, there are concepts that are becoming very important, those are Data Masking and Workflows. It is very well know that in Data Driven Projects testing with valid subsets of data becomes very important. One of the biggest challenges faced today in Data Projects is the availability of test data at the right time in order to test functionality, usually it takes a process to get test beds ready.</p><p>With Dataops tools, one of the features that is promised is Data Masking/Obfuscation which means the production data could be obfuscated and be made available quickly for testing. In the data masking process there is this concept of identifying data elements that are categorized as NPI or Confidential and obfuscating those elements. Dataops tools provide mechanism where masking can be done very quickly, this really helps the process of testing in test environments. The impacts become more visible when one is working on major projects where testing has to be done through multiple cycles and also if one is in a agile environment. One of the leading <b>Data Analytics expert Sol Rashidi</b> mentions about <b>3 S's - Speed, Scale and Shareability</b>, these are what is expected from Data projects apart from providing Business Value. In order to Satisfy the above requirements, Data masking being made available in data Ops tools is very welcoming indeed.</p><p>The other concept i wanted to discuss here is the concept of Workflows in Dataops. When we look at the data flow in general, there are source systems, data is collected into a HUB/Datawarehouse and then data is provisioned out to different applications/consumers. In order to achieve this typically lot of time is spent in developing ETL flows, moving data into different databases and curate the data to be provisioned. This involves a lot of time, cost and infrastructure. In order to alleviate these challenges, Dataops tools today introduce a concept called Workflows. The main concept here is to automate the flow of data from source to target, in addition to that also execute data quality rules, profile the data and prepare the data for consumption to various systems. Workflows do emphasize the importance of data quality checks which are much more than data validations, these can be customized to verify the type of data that need be to be present with each data attribute. When performing data quality checks in the workflow, the tools also provide the ability to set up Custom DQ Rule and provides Alerts which can be sent to teams who provide the data. There are a couple of vendors who offer the Workflow functionality, they are Zaloni Arena Product and Atlan has it in their Trail offering, hope to be in production soon. Working with quality is fundamental for any Data project, building a good framework with dataops tools provides the necessary Governance and Guardrails. Such concepts will go a long way in setting up quality data platforms which are very essential for AI and Machine Learning Initiatives.</p><p><b><u>Vendor Links:</u></b></p><p><a href="www.atlan.com">www.atlan.com</a></p><p><a href="http://www.zaloni.com">www.zaloni.com</a></p><p><br /></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-50964061859532155862020-10-13T17:53:00.007-07:002020-10-13T17:53:47.843-07:00Data Driven Culture/Product Management<p> There are 2 topics i see discussed heavily today in my connections/network or summits/round tables, they are about implementing a data driven culture, how to generate valuable insights using the data, applying AI, Machine Learning. The other aspect being Product Management, there are lot of sessions/talks about this topic, also lot of people wanting to becoming Product Managers. In a sense it seems like Data Analytics, Data Scientists and Product Manager are very glamorous titles to have. They are very responsible positions and care needs to be taken to make sure that one develops the needed skills for the above jobs. I would like to dwell a little further into these positions.</p><p>Data Driven Culture is more easily said than done, it requires a combination top/down and a bottom up approach as well. There has to be a complete embracement of the ideology by the leadership/business and technology. Everyone needs to have the understanding of what needs to be done with the data, the end state of data projects and most importantly the willingness to collaborate. Such a culture would enable better architecting of the infrastructure, good data governance/management, ability to choose the right infrastructure and platform. The focus needs to be on the value add rather just simple cost cutting, there are going to be times where certain transitions could cost money but for a eventual payoff later.This also brings up the point. of ability to using AI rin a responsible manner.</p><p>Since there is a lot of emphasis on data, it also feeds into the aspect of Product Management. Data can be used very effectively to build products, get feedback on products. Data can be a strong asset to improve customer experience and also provided value add behind the features. The type of data being represented in the product or being used to build products indicates the importance of data. Data can help with quantifiable measures, which can help in gauging how well the product is doing. There are different ways of getting feedback like user surveys, hackathons combined with interviews which can be very useful for Product Management, Having/being aware of such techniques help in grooming oneself about product management. It is a very important role which is at the intersection of Business/Customers and Stakeholders.</p><p>Product Management and Data Ops/Data Driven Culture will increasingly co-exists in the future, so focus on deriving valuable insights from data and the data culture is built to facilitate such initiatives.</p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-36428862161105478842020-10-05T09:45:00.000-07:002020-10-05T09:45:47.368-07:00Dataops - What is Data Ops...<p style="text-align: left;"> We live in a world of metaphors, there are new terms and metaphors which are heard everyday, with that it causes a lot of confusion, pressure and also some amount of chaos. It is important to filter out the noise and focus on what are needs of the business, customers/stakeholders. There are continuous attempts to streamline data projects, the reason being there is lot of unwanted costs, project delays and failed implementations. The whole purpose of data projects should be focused on value add for business or improving customer experiences and better integration of systems. In the Agile world, we have heard of Devops as a way to provide Continuous integration and Continuous deployment, similarly there emerged DataOps. What is DataOps: </p><div style="text-align: left;">As defined by <a href="https://docs.snowflake.com/en/user-guide-data-load.html">D</a>ataops manifesto: <a href="https://www.dataopsmanifesto.org/">https://www.dataopsmanifesto.org/</a>:<br /><span style="background-color: white; font-family: times; text-align: center;"><i><b>Through firsthand experience working with data across organizations, tools, and industries we have uncovered a better way to develop and deliver analytics that we call DataOps. </b></i></span></div><div style="text-align: left;"><span style="background-color: white; font-family: times; text-align: center;">Very similar to agile manifesto, there are principles involved around DataOps. In order to facilitate Dataops there are tools available</span><span style="background-color: white; font-family: times; text-align: center;"> in the market today that try to tackle different aspects of DataOps. Some of the major areas in Dataops includes:</span></div><p><span class="lang-subheader" style="background-color: white; box-sizing: border-box; text-align: center;"><span style="font-family: times;"><b>Data Quality </b>- Very important, ability to perform simple to complex data quality checks at the time of ingestion of data. Data quality need to implemented as part of workflows where in the data engineer can track the records that were imported successfully and remediate records that failed.</span></span></p><p><span class="lang-subheader" style="background-color: white; box-sizing: border-box; text-align: center;"><span style="font-family: times;"><b>Workflows</b> - Ability to track data from sourcing to provisioning including the ability to profile, apply DQ Checks. Workflows need to be persisted.</span></span></p><p><span class="lang-subheader" style="background-color: white; box-sizing: border-box; text-align: center;"><span style="font-family: times;"><b>Data Lineage </b>- Ability to track how data points are connected from source systems all the way to provisioning systems.</span></span></p><p><span class="lang-subheader" style="background-color: white; box-sizing: border-box; text-align: center;"><span style="font-family: times;"><b>Metadata Management</b> - Categorizing all the different business, logical entities within a value chain and also have the ability to have a horizontal vision across the enterprise.</span></span></p><p><span class="lang-subheader" style="background-color: white; box-sizing: border-box; text-align: center;"><span style="font-family: times;"><b>Data Insights </b>- Based on the 3 aspects mentioned above, ability to generate valuable insights and provide business value for customers/stakeholders.</span></span></p><p><span class="lang-subheader" style="background-color: white; box-sizing: border-box; text-align: center;"><span style="font-family: times;"><b>Self Service</b> - Dataops also relies on building platforms where in different types of personas/users are able to handle their requests in a efficient manner.</span></span></p><p><span style="font-family: times;"><b>Handle</b></span><span style="font-family: times; text-align: center;"><b> the 3 D's</b>: They are Technical Debt, Data Debt and Brain Debt. I would like to thank Data Engineer/Cloud Consultant Bobby Allen for sharing this concept with me. Extremely important to handle this while taking up data projects.</span></p><p><span style="font-family: times; text-align: center;"><b>Ability to build and dispose Environments</b> - Data Projects rely heavily on data, the ability to build environments for data projects and quickly dismantle them for newer projects is the key.</span></p><p><span style="font-family: times; text-align: center;">It is very important to implement DataOps in terms of what is the value add for the business and how data will improve Customer Experience.</span></p><p><span style="font-family: times; text-align: center;">There tools that implement Dataops, some of the tools already in the market are: <b>Atlan, Amazon Aethana</b>.</span></p><p><span style="font-family: times; text-align: center;"><br /></span></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-1542995633496719632020-09-27T07:51:00.004-07:002020-09-27T07:51:53.818-07:00Data Discovery Tools<p>In today's world, data is the new asset or some day it is the new oil. Whether it is an asset or the new oil depends on how much of valid information/insights are determined from the data assets. In order to do a viable data project or if the data has to be useful to the business, it is extremely important to understand the data. This where data discovery comes in, in the past few years there been a significant developments in this domain. Earlier doing data discovery was lot of grunt work with very manual processes and updating metadata information was very time consuming.One of the data discovery product that i have been looking at and closely following is Atlan, i had briefly mentioned in my earlier blog, link is <a href="http://www.atlan.com">http://www.atlan.com</a>. I signed up for a onboarding trial with Atlan and the whole process getting on boarded was very smooth, folks from Atlan guided me through this process. I was very excited to see what the product has to offer, given the pain points we have in our current process.</p><p>Once I logged in i was presented with a google like search interface and there are options for <b>Discover, Glossary, Classification, Access </b>on the left side of the home page. In the <b><u>search bar</u></b>, you type in the data asset that you want to search, one critical step here is that you have connected Atlan to a public cloud provider like Amazon, Azure, in my case it was connected to a Snowflake DB/Warehouse. when you click the search button, all the data assets related to the search term are pulled up. The first i noticed is that it provides a snapshot of row count and number of columns. </p><p>When you click on the table, you are presented with a preview window with data, column information on the right, below that you have classification, with owner and SME information. Seeing all of these information in one window provides lot of efficiency, helps one start getting some context around the data. In the column list, there is also description for each column which can be edited and updated. As a analyst/Business user this feature is extremely useful. Above the data preview window, you are provided with <b><u>Query/Lineage/Profile/Settings</u></b> options. Each one of these have deeper functionality when you click on them. The interface flows very logically and is set up in such a way that all operations related to data discovery and analysis can be done in this tool. I will write a follow up blog post as i explore the lineage aspect of the tool much more.</p><p>One of the key aspects of a data project to ensure a solid foundation is to have a very good <b>Metadata/Glossary </b>of the data points. This would contain Business entities/Logical Entities and relationships along with lineage. In Atlan, this is accomplished by using the Glossary option that is available on the left pane of the dashboard. As part of the Glossary once can add Categories and Terms. The categories can be used for setting up Business Value Chains, Business/Logical Entities,Sourcing,API,Provisioning which in turn will provide context around the data. The terms will be useful for identifying individual data elements, also can be linked back to the actual tables/column. The link feature is also available for Categories. Atlan also provides a method to bulk load Glossary items based on a template that can be downloaded for Categories and Terms.</p><p>More coming as i dig deeper into some of use cases...Keep Learning, Keep Growing.</p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-38348321720789824532020-09-21T11:35:00.006-07:002020-09-21T11:35:48.933-07:00Snowflake - Data Loading Startegies<p>Snowflake is a key player in the cloud database offering space, along with Redshift which is a amazon offering. Interestingly Snowflake uses Amazon S3 for storage as part of the amazon cloud offering, while amazon continues to promote redshift. It is going to be interesting to see how this pans out the relationship between Amazon and SnowFlake, There is another competitor in the mix, which is the vendor Cloudera. More on this dynamics later, now let us move forward with data loading strategies in snowflake.</p><div style="text-align: left;">At a very high level, snowflake supports the following in terms of Location of the files:</div><div style="text-align: left;">1. Local Environment (files in a local folder) - In such instances the files are first move to a snowflake stage area and then loaded into a table in snowflake DB.</div><div style="text-align: left;">2. Amazon S3 - Files that are loaded from a user supplied S3 Bucket</div><div style="text-align: left;">3. Microsoft Azure - Flies are loaded from user defined Azure container.</div><div style="text-align: left;">4. Google Cloud Storage - Files loaded from user supplied cloud storage container</div><div style="text-align: left;"><br /></div><div style="text-align: left;">In addition to the above, the file formats that are supported are: CSV,JSON,AVRO,ORC,Parquet, XML is a preview feature at this point. There are different ways of data loading into snowflake, the method i would like to highlight in this blog post is the Bulk loading using COPY method.</div><div style="text-align: left;">The Bulk Load Using COPY method steps are a little different for each of the file locations mentioned above.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">In the Situation where data has to be copied from a local file system, the data is first copied to a snow flake stage using the PUT command and then moved to a snowflake table. There are different types of Stage that are available in Snowflake. 1. User Stages, 2. Table Stages, 3. Internal Named Stages. User Stage is useful when the files are copied to multiple tables but accessed by a single user. The table stage is used when all the files are copied to a single table but used by multiple users. Internal Named Stage provided the maximum flexibility in terms of data loading. Based on privileges the data can be loaded into any table, this is recommended when doing regular data loads that involve multiple users and tables.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Once you decided on the type of the stage that is needed, then you create the stage, copy the files using the PUT command, and then use the COPY command to move the data into the snow flake table. The steps mentioned could vary slight based on the location of the files. For Amazon S3 storage you would use AWS tools to move the files to the stage area and then COPY into SnowFlake DB. For Google and Microsoft Azure use similar tools available in each cloud platform to move the files into the Stage area in Snowflake. For all the detailed information and support, please refer to the link below.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><a href="https://docs.snowflake.com/en/user-guide-data-load.html">https://docs.snowflake.com/en/user-guide-data-load.html</a></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Loading data into snowflake db is the first step in exploring the features and the power of the cloud database offering, where once can test out the columnar database features.</div><div style="text-align: left;"><br /></div>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-82011271174387919932020-09-19T14:05:00.007-07:002020-09-19T14:05:51.923-07:00Online Transaction History - Database Design Strategies<p>In todays world of technology one of the common occurrence in financial services is the concept of omni Channel. The basic premise is that customers can access information related to their accounts(checking/savings/Credit/Debit/Mortgage) information through various channels such as:</p><div style="text-align: left;"><b>1. Financial Centers<br />2. Online Banking<br />3. Mobile/Phone Applications<br />4. Statements related to accounts (Mailed)</b></div><div style="text-align: left;"><b>5. SMS/Email (where applicable)</b></div><div style="text-align: left;"><br /></div><div style="text-align: left;">When information related to accounts is presented via different channels like above, it is critical/obvious to have the customer experience consistent. Now looking at the technologies that are utilized to solve the above problem/create such experiences, API's have made a tremendous amount of penetration. The API layer has succeeded in making the customer request from the client applications/Phone Apps very seamless. Now these API's have to have a very good response time, for example if i am looking at the balance of my account through a phone banking app, the results need to come back quickly. In case response times are slow it will lead to bad customer experience. It is very essential that the Data Services behind these API's are very efficient. This in turn translates to have a very good database design (The databases can be on perm or on the cloud). Lot of times when use the applications and go to financial centers we tend to take these response times for granted. Recently i had the opportunity to work on designing a solution for a online/mobile banking channel to display transaction/statement information.</div><div style="text-align: left;">The data was going to be accessed via calling API/Web services by the client applications. The data resided in a exadata oracle platform.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The information needed for providing transaction information was coming from a vendor which gets ingested into the exadata database. In order to provide the information to the client, a process had to be run on the production database to aggregate the transaction information. Now the challenge was when these processes are running, if a client tries to access his transaction information, how does one make sure there is no distortion or breaking of the service call. Information still needed to be provided to the customer and there cannot be a time lag. In order to achieve this we had 2 options:</div><div style="text-align: left;"><br /></div><div style="text-align: left;">1. Perform a <b>SYNONYM SWAP</b> as part of the Oracle Procedure that is aggregating the information. Basically in this scenario, see example below, available in link: <a href="https://dba.stackexchange.com/questions/177959/how-do-i-swap-tables-atomically-in-oracle">https://dba.stackexchange.com/questions/177959/how-do-i-swap-tables-atomically-in-oracle</a></div><div style="text-align: left;">We went with this option, the data was reloaded everyday, but we started to service call failures only at the time when the synonym swap happened.</div><div style="text-align: left;">2. We used this option, <b>Perform delta processing of records</b> every day and merge the changes into main table, use batch sizes during the final merge so that records are ingested into the main table in small chunks and that should minimize any contention of resources. In this option we processed only changed/new records and we did not perform any synonym swap. In this option, though it took a little longer for the job to run complete, there was no distortion of the service and the sla was well within what the customer expected. In order to get the accounts that have changed, we used a table to maintain the tables that are involved in the processing and capture the accounts that have changed in those tables.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">These were couple of options we experimented with and we went with Option 2. It is very critical to design your database according to the expectations of the online/mobile applications. We experimented with multiple options and we narrowed down to the 2 options mentioned above.</div><div style="text-align: left;">In case you happen to read this post in my blog and you have any other suggestions, please leave a comment and i will definitely look into it.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-21486400320145651062020-09-14T08:31:00.000-07:002020-09-14T08:31:57.890-07:00Snowflake - Cloud Database/Datawarehouse<p>With the advent of public clod like AWS, Google Cloud, Azure and the adoption of these public cloud services by various businesses, companies and organizations, one of the main talking points is how data can be stored in the cloud, security concerns, architecture. These are all the topics that are of main interest when storing data in the cloud. In certain organizations the move to cloud has been very quick, in certain sectors the adoption has been pretty slow primarily due to security concerns. Now these challenges are being overcome steadily. In terms of data services, one of the cloud platforms that is very popular for the last few years and also getting ready to go for IPO is SnowFlake. The link for the company is <a href="http://www.snowflake.com">www.snowflake.com</a>. Snowflake is a global platform for all your data services, data lakes and data science applications. Snowflake is not a relational database but supports SQL basic operations, DDL,DML, UDF,Stored Procedures. Snowflake uses Amazon S3 and now Azure as the public cloud platform for providing the data services over the cloud. Snowflkes architecture in terms of the database is that it uses columnar storage to enable faster processing of queries. Data is loaded into Amazon S3 through files into user areas and then is moved into the snowflake schema/ databases for enablement of queries. Please refer to the snowflake company website for additional information on architecture, blogs and other kits that are available for one to check out all the features. Snowflake takes advantage of the Amazon S3 storage power and uses its own columnar and other data warehouses related features for computational purposes. One can also refer to youtube for additional details on snowflake architecture. Here is a link: <a href="https://www.youtube.com/watch?v=dxrEHqMFUWI&t=14s">https://www.youtube.com/watch?v=dxrEHqMFUWI&t=14s</a> that cane be used for snowflake architecture.</p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0tag:blogger.com,1999:blog-2437651727370625818.post-72153748701013433012020-09-10T12:39:00.001-07:002020-09-10T12:39:42.969-07:00<p><b><span style="color: #2b00fe; font-family: georgia; font-size: large;"><u> AI, Machine Learning, Data Governance</u></span></b></p><p><span style="font-family: georgia;"><b>Artificial Intelligence, machine Learning hav continued to penetrate all walks of life and technology has undergone tremendous amount of changes. It is being said that Data is the new oil which actually has propelled AI and ML to greater heights. In order to use AI and ML more effectively in the business today, it is imperative that all the stakeholders, consumers and technologists understand the importance of data. There should be very good collaboration between all the parties involved to make good use of data and take it forward to use AI and ML effectively. For data to be used effectively in an organization, we need proper guardrails to source the data, clean the data, remove unwanted data, store and provision data to various users. Here is where data governance comes in, there has to be a enterprise wide appreciation for having such process and standard. It should come off as process heavy or bureaucratic but something that is efficient and at the same able to manage data effectively. As organizations grow, there is going to be a vertical and horizontal implementation of data governance and both of them need to be in sync. This in turn is very essential for AI and ML efforts because it will make the outcomes more meaningful to the organization. In addition better contexts would be defined which will make the AI and ML projects more viable and reduce inefficiencies and provide cost benefits.</b></span></p><p><span style="font-family: georgia;"><b>One of the important step in achieving the above mentioned steps is to have very data cataloguing measures , persist all the logical, business entities, lineage of all the data being sourced to be all in place. The data also need to be classified as NPI or non NPI depending on the business context. In today's world majority of the work mentioned above is manual and a lot of time is spent in trying to get SME inputs and approval. This causes time delays and project cost increase, this can be alleviated by using data discovery tools that are available today. The are quite a few tools available but the one i have started to look more into the capabilities is the tool from Atlan: </b></span><span style="font-family: georgia;"><a href="https://atlan.com/" style="font-weight: bold;">https://atlan.com/</a><b>. atlan provides an excellent platform for performing Data Discovery, Lineage, Profiling, Governance and exploration. In what i have seen with the tool and the demo provided to me, the whole data life cycle has been very nicely captured.The user interface is very intuitive and the tool also helps the user navigate through the different screens without any technical inputs needed. The search is very google like in terms of looking up the different data assets that are available. I will be doing some more use cases and deep dive into the tool in the next couple of weeks and will provide more updates.</b></span></p>BI_Buffhttp://www.blogger.com/profile/06254495164438608712noreply@blogger.com0