Yesterday I had the opportunity to attend the SQL Saturday Event in Charlotte. It was a packed day with lot of topics, all of them focused on the Business Intelligence Space. For me the day started of with a session on Upgrading to SSIS 2012 by Paul Rizza From Microsoft, it was a very informative session with lots of pointers towards what needs to be looked at in SSIS 2012, I am planning to write a more detailed post on this topic later. One of the important lectures I attended was on Hadoop vs SQL and how Hadoop can be utilised in Data Warehouses. As expected it was a packed session with folks looking forward to what is Hadoop and see what all the hype was about. Hadoop targets the unstructured data space and has mechanism to plow through the huge amounts of unstructured data that is available these days. Companies today are trying to make meaningful dissection of unstructured data and help the strategic direction of the company. The presentation was provided by Rob Kerr (CTO with BlueGranite) and it was full of valuable information. One of the differences bet ween SQL and Hadoop is that SQL follows the Schema Write methodology, meaning first the table structure or the schema is created, then data is inserted into the tables. In Hadoop the Schema Read methodology is followed meaning as the data is loaded the schema is determined at run time which allows it to tackle unstructured data. The mechanism that is followed to decipher unstructured data is that it uses the MapReduce methodology. During the Map() function the data is being assigned to different nodes by a central node, then after that Reduce() function kicks to parse the data, once the data parsing is done the data from the different nodes is sent back to the requester. The other difference between SQL and Hadoop is that SQL requires consistency and structure follows the ACID principle, Hadoop does not require the structure and focuses on data availability, Hadoop framework also processes information in a batch.
It is very much possible that both SQL and Hadoop can coexists in Datawarehouse environment. For example let us say one has a datawarehouse regarding call center data which houses all the structured information, at the same time there is lot of data captured in a call center to chats/online mechanisms. Instead of shedding away the unstructured data, one can use Hadoop to parse through the unstructured data, look for keywords that would matter to the business and store it. One can work on integrating Hadoop with the datawarehouse and get unstructured data to be fed into the Datawarehouse. To get more information about the presentation about Hadoop and SQL please use the following link:
There are other links in the above site which provide more information about Hadoop and SQL.
There a layer on top of Hadoop where in users can use sql to get information about unstructured data that is through a tool called Hive. Hive sits on top of Hadoop layer which can accept sql commands. Once the sql command is entered by the user the query is passed on to the MapReduce mechanism of Hadoop. Once hadoop receives the request and data is parsed out the result then returned back to the Hive Layer where data is presented back. Please use the following link to learn more about Hive:
Here are some examples from the above site:
Creating Hive tables and browsing through them
hive> CREATE TABLE pokes (foo INT, bar STRING);
Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
Tools are also getting better with the Hadoop framework which is making it better to get meaningful data out of unstructured information.
To summarise after the session I felt that i gotten better information Hadoop and how the framework can coexists with SQL and utilise the information embedded in unstructured data.