Sunday, September 27, 2020

Data Discovery Tools

In today's world, data is the new asset or some day it is the new oil. Whether it is an asset or the new oil depends on how much of valid information/insights are determined from the data assets. In order to do a viable data project or if the data has to be useful to the business, it is extremely important to understand the data. This where data discovery comes in, in the past few years there been a significant developments in this domain. Earlier doing data discovery was lot of grunt work with very manual processes and updating metadata information was very time consuming.One of the data discovery product that i have been looking at and closely following is Atlan, i had briefly mentioned in my earlier blog, link is http://www.atlan.com. I signed up for a onboarding trial with Atlan and the whole process getting on boarded was very smooth, folks from Atlan guided me through this process. I was very excited to see what the product has to offer, given the pain points we have in our current process.

Once I logged in i was presented with a google like search interface and there are options for Discover, Glossary, Classification, Access on the left side of the home page. In the search bar, you type in the data asset that you want to search, one critical step here is that you have connected Atlan to a public cloud provider like Amazon, Azure, in my case it was connected to a Snowflake DB/Warehouse. when you click the search button, all the data assets related to the search term are pulled up. The first i noticed is that it provides a snapshot of row count and number of columns. 

When you click on the table, you are presented with a preview window with data, column information on the right, below that you have classification, with owner and SME information. Seeing all of these information in one window provides lot of efficiency, helps one start getting some context around the data. In the column list, there is also description for each column which can be edited and updated. As a analyst/Business user this feature is extremely useful. Above the data preview window, you are provided with Query/Lineage/Profile/Settings options. Each one of these have deeper functionality when you click on them. The interface flows very logically and is set up in such a way that all operations related to data discovery and analysis can be done in this tool. I will write a follow up blog post as i explore the lineage aspect of the tool much more.

One of the key aspects of a data project to ensure a solid foundation is to have a very good Metadata/Glossary of the data points. This would contain Business entities/Logical Entities and relationships along with lineage. In Atlan, this is accomplished by using the Glossary option that is available on the left pane of the dashboard. As part of the Glossary once can add Categories and Terms. The categories can be used for setting up Business Value Chains, Business/Logical Entities,Sourcing,API,Provisioning which in turn will provide context around the data. The terms will be useful for identifying individual data elements, also can be linked back to the actual tables/column. The link feature is also available for Categories. Atlan also provides a method to bulk load Glossary items based on a template that can be downloaded for Categories and Terms.

More coming as i dig deeper into some of use cases...Keep Learning, Keep Growing.

Monday, September 21, 2020

Snowflake - Data Loading Startegies

Snowflake is a key player in the cloud database offering space, along with Redshift which is a amazon offering. Interestingly Snowflake uses Amazon S3 for storage as part of the amazon cloud offering, while amazon continues to promote redshift. It is going to be interesting to see how this pans out the relationship between Amazon and SnowFlake, There is another competitor in the mix, which is the vendor Cloudera. More on this dynamics later, now let us move forward with data loading strategies in snowflake.

At a very high level, snowflake supports the following in terms of Location of the files:
1. Local Environment (files in a local folder) - In such instances the files are first move to a snowflake stage area and then loaded into a table in snowflake DB.
2. Amazon S3 - Files that are loaded from a user supplied S3 Bucket
3. Microsoft Azure - Flies are loaded from user defined Azure container.
4. Google Cloud Storage - Files loaded from user supplied cloud storage container

In addition to the above, the file formats that are supported are: CSV,JSON,AVRO,ORC,Parquet, XML is a preview feature at this point. There are different ways of data loading into snowflake, the method i would like to highlight in this blog post is the Bulk loading using COPY method.
The Bulk Load Using COPY method steps are a little different for each of the file locations mentioned above.

In the Situation where data has to be copied from a local file system, the data is first copied to a snow flake stage using the PUT command and then moved to a snowflake table. There are different types of Stage that are available in Snowflake. 1. User Stages, 2. Table Stages, 3. Internal Named Stages. User Stage is useful when the files are copied to multiple tables but accessed by a single user. The table stage is used when all the files are copied to a single table but used by multiple users. Internal Named Stage provided the maximum flexibility in terms of data loading. Based on privileges the data can be loaded into any table, this is recommended when doing regular data loads that involve multiple users and tables.

Once you decided on the type of the stage that is needed, then you create the stage, copy the files using the PUT command, and then use the COPY command to move the data into the snow flake table. The steps mentioned could vary slight based on the location of the files. For Amazon S3 storage you would use AWS tools to move the files to the stage area and then COPY into SnowFlake DB. For Google and Microsoft Azure use similar tools available in each cloud platform to move the files into the Stage area in Snowflake. For all the detailed information and support, please refer to the link below.


Loading data into snowflake db is the first step in exploring the features and the power of the cloud database offering, where once can test out the columnar database features.

Saturday, September 19, 2020

Online Transaction History - Database Design Strategies

In todays world of technology one of the common occurrence in financial services is the concept of omni Channel. The basic premise is that customers can access information related to their accounts(checking/savings/Credit/Debit/Mortgage) information through various channels such as:

1. Financial Centers
2. Online Banking
3. Mobile/Phone Applications
4. Statements related to accounts (Mailed)
5. SMS/Email (where applicable)

When information related to accounts is presented via different channels like above, it is critical/obvious to have the customer experience consistent. Now looking at the technologies that are utilized to solve the above problem/create such experiences, API's have made a tremendous amount of penetration. The API layer has succeeded in making the customer request from the client applications/Phone Apps very seamless. Now these API's have to have a very good response time, for example if i am looking at the balance of my account through a phone banking app, the results need to come back quickly. In case response times are slow it will lead to bad customer experience. It is very essential that the Data Services behind these API's are very efficient. This in turn translates to have a very good database design (The databases can be on perm or on the cloud). Lot of times when use the applications and go to financial centers we tend to take these response times for granted. Recently i had the opportunity to work on designing a solution for a online/mobile banking channel to display transaction/statement information.
The data was going to be accessed via calling API/Web services by the client applications. The data resided in a exadata oracle platform.

The information needed for providing transaction information was coming from a vendor which gets ingested into the exadata database. In order to provide the information to the client, a process had to be run on the production database to aggregate the transaction information. Now the challenge was when these processes are running, if a client tries to access his transaction information, how does one make sure there is no distortion or breaking of the service call. Information still needed to be provided to the customer and there cannot be a time lag. In order to achieve this we had 2 options:

1. Perform a SYNONYM SWAP as part of the Oracle Procedure that is aggregating the information. Basically in this scenario, see example below, available in link: https://dba.stackexchange.com/questions/177959/how-do-i-swap-tables-atomically-in-oracle
We went with this option, the data was reloaded everyday, but we started to service call failures only at the time when the synonym swap happened.
2. We used this option, Perform delta processing of records every day and merge the changes into main table, use batch sizes during the final merge so that records are ingested into the main table in small chunks and that should minimize any contention of resources. In this option we processed only changed/new records and we did not perform any synonym swap. In this option, though it took a little longer for the job to run complete, there was no distortion of the service and the sla was well within what the customer expected. In order to get the accounts that have changed, we used a table to maintain the tables that are involved in the processing and capture the accounts that have changed in those tables.

These were couple of options we experimented with and we went with Option 2. It is very critical to design your database according to the expectations of the online/mobile applications. We experimented with multiple options and we narrowed down to the 2 options mentioned above.
In case you happen to read this post in my blog and you have any other suggestions, please leave a comment and i will definitely look into it.


Monday, September 14, 2020

Snowflake - Cloud Database/Datawarehouse

With the advent of public clod like AWS, Google Cloud, Azure and the adoption of these public cloud services by various businesses, companies and organizations, one of the main talking points is how data can be stored in the cloud, security concerns, architecture. These are all the topics that are of main interest when storing data in the cloud. In certain organizations the move to cloud has been very quick, in certain sectors the adoption has been pretty slow primarily due to security concerns. Now these challenges are being overcome steadily. In terms of data services, one of the cloud platforms that is very popular for the last few years and also getting ready to go for IPO is SnowFlake. The link for the company is www.snowflake.com. Snowflake is a global platform for all your data services, data lakes and data science applications. Snowflake is not a relational database but supports SQL basic operations, DDL,DML, UDF,Stored Procedures. Snowflake uses Amazon S3 and now Azure as the public cloud platform for providing the data services over the cloud. Snowflkes architecture in terms of the database is that it uses columnar storage to enable faster processing of queries. Data is loaded into Amazon S3 through files into user areas and then is moved into the snowflake schema/ databases for enablement of queries. Please refer to the snowflake company website for additional information on architecture, blogs and other kits that are available for one to check out all the features. Snowflake takes advantage of the Amazon S3 storage power and uses its own columnar and other data warehouses related features for computational purposes. One can also refer to youtube for additional details on snowflake architecture. Here is a link: https://www.youtube.com/watch?v=dxrEHqMFUWI&t=14s that cane be used for snowflake architecture.

Thursday, September 10, 2020

 AI, Machine Learning, Data Governance

Artificial Intelligence, machine Learning hav continued to penetrate all walks of life and technology has undergone tremendous amount of changes. It is being said that Data is the new oil which actually has propelled AI and ML to greater heights. In order to use AI and ML more effectively in the business today, it is imperative that all the stakeholders, consumers and technologists understand the importance of data. There should be very good collaboration between all the parties involved to make good use of data and take it forward to use AI and ML effectively. For data to be used effectively in an organization, we need proper guardrails to source the data, clean the data, remove unwanted data, store and provision data to various users. Here is where data governance comes in, there has to be a enterprise wide appreciation for having such process and standard. It should come off as process heavy or bureaucratic but something that is efficient and at the same able to manage data effectively. As organizations grow, there is going to be a vertical and horizontal implementation of data governance and both of them need to be in sync. This in turn is very essential for AI and ML efforts because it will make the outcomes more meaningful to the organization. In addition better contexts would be defined which will make the AI and ML projects more viable and reduce inefficiencies and provide cost benefits.

One of the important step in achieving the above mentioned steps is to have very data cataloguing measures , persist all the logical, business entities, lineage of all the data being sourced to be all in place. The data also need to be classified as NPI or non NPI depending on the business context. In today's world majority of the work mentioned above is manual and a lot of time is spent in trying to get SME inputs and approval. This causes time delays and project cost increase, this can be alleviated by using data discovery tools that are available today. The are quite a few tools available but the one i have  started to look more into the capabilities is the tool from Atlan: https://atlan.com/. atlan provides an excellent platform for performing Data Discovery, Lineage, Profiling, Governance and exploration. In what i have seen with the tool and the demo provided to me, the whole data life cycle has been very nicely captured.The user interface is very intuitive and the tool also helps the user navigate through the different screens without any technical inputs needed. The search is very google like in terms of looking up the different data assets that are available. I will be doing some more use cases and deep dive into the tool in the next couple of weeks and will provide more updates.