Thursday, April 22, 2021

Data Ingestion-Hive/Impala/HUE

 With the emergence of all the big data, cloud technologies, there is a major push to get data moved from legacy systems to cloud. The reason for the migration is to provide better data access, allow for more advanced analytics, provide better insights and better customer experience. This type of migration is also to provide more self service capabilities. One of the initiative that i have been involved with is to move data from oracle databases to Hadoop. There are lot of pieces involved in moving data from RDBMS to hadoop. Here are some steps involved that i would like to share along with the tools involved.

1. Extraction of data from RDBMS to HDFS - What type of tools available: Sqoop works for pulling data from rdbms to HDFS, now one thing to watch out for is performance, how big are the tables. The data is extracted in the form of text files. So it is good to be clear of what type of data needs to be on HDFS. There are other tools like Zaloni(Arena),Atlan, some of these work well with Public Cloud providers.

2. Design the the structures where the files will be stored - The approach i am using is to utilize HIVE where in we have different schemas depending on the classification of data. Create tables in thee schemas and pull i the data using the text files created from Step 1. There are 2 main types of HIVE tables - Internal and External. HIVE supports SQL, it is very similar to SQL in RDBMS with few variations. There is UI available called HUE which is Hadoop User Experience. This is web based, so with the right permissions, one can connect to HIVE. Once you login, you can view the schemas/tables that have been created in HIVE. You can write SQL type queries and view results. HIVE is Apache based, where Impala is  Cloudera based. The difference is that in HIVE you have metadata and query capability whereas in IMPALA you mainly work with queries

3. Typically when you first load data into HIVE, they are text files, basically in Text format. These text files are loaded into tables. Now when you are ready to make the data available for processing, i decided to create separate schemas where the data is still stored in tables but it is stored in Parquet format. This allows for compression , it is in columnar format. For analytics purposes this works well and other thing is for our installations we use parquet. There are other big data formats that can be used:

Avro, Parquet, ORC, see the link below for comparisons.

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

4. Once Step 1 through 3 is completed then you are in a position to start provisioning the data, typically this data can feed into a data lake. The data lake can be on premise or it could be a private or a public cloud based on the business need,

Monday, April 5, 2021

Data Processing Unified - Omniscope

 There are lot of tools coming out in the market related to data management, analytics and visualization. It is also becoming harder for data practitioners to choose tools, so it is natural that we also expect to see consolidation of tools that handle data sourcing, management, transformation and visualization. One such tool that i got to see demonstrated is Omniscope. The link to get to see what the tool offer, please see here:https://visokio.com/. The tools goal is to unify the following components of a data journey

  1. Data Sourcing
  2. Data Blending - This is the step where you can combine data from different data sources. This will be very handy when say you are trying to combine customer information that is on a system different from Borrower/Loan information. This is an example, you will have scenarios where you have to perform this step.
  3. Data Transformation - This is the step where you take the source or the blended data and add additional attributes/calculations based on your end business goal.
  4. Data Lineage - The functionality that provides the journey of a data element as it flows from source to different target systems. This helps in also understanding how a data element is used in different calculations/reports.
  5. Data Visualization - The layer that displays the data in different forms (like graphs, charts, plots)

The tool helps data curators/scientists/analyst to maintain the pipeline of data coming through and how it is to be consumed by users. There are plenty of visualizations available in this tool and the color schemes used are very good. This tool also attempts to maintain lineage of the data that is used for the report, thereby providing an opportunity for end to end governance. In addition to all of these, there are features related to scheduling and automation. Please refer to the link above for more in depth look at the features of this tool.