With the emergence of all the big data, cloud technologies, there is a major push to get data moved from legacy systems to cloud. The reason for the migration is to provide better data access, allow for more advanced analytics, provide better insights and better customer experience. This type of migration is also to provide more self service capabilities. One of the initiative that i have been involved with is to move data from oracle databases to Hadoop. There are lot of pieces involved in moving data from RDBMS to hadoop. Here are some steps involved that i would like to share along with the tools involved.
1. Extraction of data from RDBMS to HDFS - What type of tools available: Sqoop works for pulling data from rdbms to HDFS, now one thing to watch out for is performance, how big are the tables. The data is extracted in the form of text files. So it is good to be clear of what type of data needs to be on HDFS. There are other tools like Zaloni(Arena),Atlan, some of these work well with Public Cloud providers.
2. Design the the structures where the files will be stored - The approach i am using is to utilize HIVE where in we have different schemas depending on the classification of data. Create tables in thee schemas and pull i the data using the text files created from Step 1. There are 2 main types of HIVE tables - Internal and External. HIVE supports SQL, it is very similar to SQL in RDBMS with few variations. There is UI available called HUE which is Hadoop User Experience. This is web based, so with the right permissions, one can connect to HIVE. Once you login, you can view the schemas/tables that have been created in HIVE. You can write SQL type queries and view results. HIVE is Apache based, where Impala is Cloudera based. The difference is that in HIVE you have metadata and query capability whereas in IMPALA you mainly work with queries
3. Typically when you first load data into HIVE, they are text files, basically in Text format. These text files are loaded into tables. Now when you are ready to make the data available for processing, i decided to create separate schemas where the data is still stored in tables but it is stored in Parquet format. This allows for compression , it is in columnar format. For analytics purposes this works well and other thing is for our installations we use parquet. There are other big data formats that can be used:
Avro, Parquet, ORC, see the link below for comparisons.
4. Once Step 1 through 3 is completed then you are in a position to start provisioning the data, typically this data can feed into a data lake. The data lake can be on premise or it could be a private or a public cloud based on the business need,