Sunday, October 28, 2012

SQL vs Hadoop, Hive...

Yesterday I had the opportunity to attend the SQL Saturday Event in Charlotte. It was a packed day with lot of topics, all of them focused on the Business Intelligence Space. For me the day started of with a session on Upgrading to SSIS 2012 by Paul Rizza From Microsoft, it was a very informative session with lots of pointers towards what needs to be looked at in SSIS 2012, I am planning to write a more detailed post on this topic later.  One of the important lectures I attended was on Hadoop vs SQL and how Hadoop can be utilised in Data Warehouses. As expected it was a packed session with folks looking forward to what is Hadoop and see what all the hype was about. Hadoop targets the unstructured data space and has mechanism to plow through the huge amounts of unstructured data that is available these days. Companies today are trying to make meaningful dissection of unstructured data and help the strategic direction of the company. The presentation was provided by Rob Kerr (CTO with BlueGranite) and it was full of valuable information. One of the differences bet ween SQL and Hadoop is that SQL follows the Schema Write methodology, meaning first the table structure or the schema is created, then data is inserted into the tables. In Hadoop the Schema Read methodology is followed meaning as the data is loaded the schema is determined at run time which allows it to tackle unstructured data. The mechanism that is followed to decipher unstructured data is that it uses the MapReduce methodology. During the Map() function the data is being assigned to different nodes by a central node, then after that Reduce() function kicks to parse the data, once the data parsing is done the data from the different nodes is sent back to the requester. The other difference between SQL and Hadoop is that SQL requires consistency and structure follows the ACID principle, Hadoop does not require the structure and focuses on data availability, Hadoop framework also processes information in a batch.

It is very much possible that both SQL and Hadoop can coexists in Datawarehouse environment. For example let us say one has a datawarehouse regarding call center data which houses all the structured information, at the same time there is lot of data captured in a call center to chats/online mechanisms. Instead of shedding away the unstructured data, one can use Hadoop to parse through the unstructured data, look for keywords that would matter to the business and store it. One can work on integrating Hadoop with the datawarehouse and get unstructured data to be fed into the Datawarehouse. To get more information about the presentation about Hadoop and SQL please use the following link:
http://msbiacademy.com/Lesson.aspx?id=108
There are other links in the above site which provide more information about Hadoop and SQL.

There a layer on top of Hadoop where in users can use sql to get information about unstructured data that is through a tool called Hive. Hive sits on top of Hadoop layer which can accept sql commands. Once the sql command is entered by the user the query is passed on to the MapReduce mechanism of Hadoop. Once hadoop receives the request and data is parsed out the result then returned back to the Hive Layer where data is presented back. Please use the following link to learn more about Hive:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Here are some examples from the above site:

Creating Hive tables and browsing through them

hive> CREATE TABLE pokes (foo INT, bar STRING);
Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Tools are also getting better with the Hadoop framework which is making it better to get meaningful data out of unstructured information.

To summarise after the session I felt that i gotten better information Hadoop and how the framework can coexists with SQL and utilise the information embedded in unstructured data.

Thursday, October 18, 2012

SQL Server DB Compare Tool...

When working on database/data warehousing projects one gets to work on multiple environments Development,QA,UAT and Production. One of the tasks that is usually done very frequently is that of comparing database objects. In a database/datawarehosuing project the database objects get constantly moved from one environment to another environment and very often there is schema's on the different environments need to be compared to see if there are any differences.I was referred to a tool by a SQL Server blog Contributor/Developer Imran Mohammad. I decided to check the tool out as I am currently doing lot of data sourcing projects where in I have to do lot of database comparisons. The name of the tool is called SQLDBDiff the link for the tool is given below: In the website there are two versions of the tool listed, the freeware and shareware version. The site lists the differences between the two versions, I decided to download the freeware version.
http://www.sqldbtools.com/Downloads.aspx?ProductId=1
The freeware version is in the form of .zip file and I extracted the files and ran the Compare Databases Tool (the file is called SQLDBDiff). In the Menu Option File one has to choose Compare two databases, this launches a window where one can connect to the source and target databases. Here the source and target databases are the databases where tables and other objects need to be compared. Once these values are filled out there is Options button where in one can check what type of database objects need to be compared.



There are a good set of Preferences ranging from tables to Server Properties. There is a Compare options tab where in one can choose options like whether system objects, statistic indexes need to be included for comparison. Once these are all set click the OK button to start the comparison, once the comparison is complete there is a set of tab based windows which appear with the list of objects which have been compared. In each row of the screen one can double click to see the details of the comparison and there is a status column which tell the user the result of the comparison (whether it was the same (green) or not (red)). The freeware version should be good enough for one to get started, in case there are more detailed features like Generate schema and Data synchronization scripts one can get the Shareware version.

Friday, October 12, 2012

SSRS-Job Monitoring

When one is working in production support handling montoring of jobs, one of the key things that is required is notification of failures/errors. Once notified the next important aspect would be how quickly can the errors be resolved. In today's world of business trying to be lean/agile and execute quickly, turn around times can be very important. Let us for example take a scenario where one is monitoring jobs (which can have mutliple steps ranging from SQL Scripts,SSIS) and there are notifications when there are failures. There are managed environments where only DBA's would have access to job failures in such cases one would have to depend on the DBA's get detailed information about failures, of course building in error logic would help. The developer's would then need to act on the failures and put in the appropriate fixes. One of the ways the trunaround times can be improved is by using SSRS reports that can use the jobs tables present in msdb database. The first SSRS report would be a summary report which would provide summary information of the jobs. The data elements for this report would range from Job Name,Frequency to getting NextRunDateandTime. This report can be generated by using a SELECT statement querying the sysjobs table and sysjobschedule table. One of the important steps that need to be performed during the design of this report is to have a drilldown on the JobName column, the purpose being to get details of the job along with step and failure information. The second report that needs to be built is the Jobs Detail report which would provide step information within the job along with messages associated with the step, this would provide valuable information for the developer to trouble shoot the problem. The main tables that would be involved are:

msdb.dbo.sysjobs

msdb.dbo.sysjobhistory
msdb.dbo.sysjobsteps
msdb.dbo.sysjobschedules (this focusses on the next run date and time for each job)
msdb.dbo.sysschedules (this has schedules for all the jobs)

One of the key tables involved here is sysjobhistory, the message column in this table would provide useful information, this table is connect to sysjobs and sysjobsteps based on job id and step id. The sysjobhistory also contains run date,run time and run duration. The date, time and duration would need to be formatted based on user needs. Once the query is built out and attached to the report, the detail report needs to be referenced in the hyperlink for the Job id in the first report. The detail report would take the Job Name as the parameter. The reports can then be deployed to the SSRS server for the developers to use. One of my team members bought up this idea and it has benefitted our dev team immensely.

Friday, October 5, 2012

Datawarehouse-Data Sourcing...

In the BI/Data warehousing Data is of paramount importance and of course how the data has to be modelled and used for reporting is an another important task, since how one uses the data would really benefit the business. Since I mentioned about Data, it is very critical that the right and what needs to be consumed is bought into a Data warehouse. Use of redundant/unwanted data could result in unnecessary usage of space and support which can drive up the cost of doing business. Data Sourcing is an important task in life cycle of a good data warehouse/BI system. Data could be coming in from multiple sources as per requirements of the business. It is very important to have a Data Sourcing strategy since that will enable technology groups to build more meaningful data warehouse systems. I did not realise how important the paradigm of data sourcing is till I started working on a Project. I would like to list different aspects that need to be considered for data sourcing. The factors listed below is of course going to vary on the type of business one is supporting.

What type of data source is the data being sourced from (Databases/Flat Files/CSV/Spreadsheets/Mainframe/NoSQL Data Sources, to name a few...)

What type of mechanism is being used for Data Sourcing: (Is it going to be a Pull Mechanism/Push Mechanism : Depending on the type of mechanism being used , the type of handshake process needs to be clearly established between the source and target systems)

One aspect which is very important is the frequency of the data feeds into the target systems. How well can the target system handle the volume of data coming in. Is there sufficient capacity to handle the load.

It is very important that the users who are going to consume the data be engaged in Data sourcing activities. Since it is for the business eventually the data is sourced. There needs to be very good analysis of requirements which would enable the technology group to determine what attributes need to be fed into the system. This is the stage where lot of data mapping exercises could be potentially performed.

It is important to determine what the needs of users are, is it going to be real time reporting or archival type of reporting. This is going to be closely linked with the frequency of the data coming, the type of database architecture in place. As one can see from this, how all of the different aspects of databases and system architecture is closely related to each other.

Since we are dealing with Data transport, one needs to factor in what happens when there are errors, how does one recover from it. There needs to be focus on Data reconciliation and how source system would support it.

I have listed some of the key points related to data sourcing, there is always room for further discussion this topic which I will do in a another blog post.

Tuesday, October 2, 2012

SQL Server Database Testing...

One of the tasks that is vital once Development is complete as part of a database development project is testing. There are different types of testing on Database Development projects that involve stored procedures, functions,Views and tables. Initially lot of the testing was based on sql scripts which would have been developed for different scenarios. Lot of the database testing which I did was done primarily using SQL Scripts. These days lots of development shops have adopted Agile Framework wherein the Testers and Developers need to be really co-ordinated. Recently there are tools that are being developed for Database Testing which provide lot more options and flexibility. One of the tools that i came across was SQLTest From Red Gate Software. I have used tools developed by Red Gate and have found them very useful and efficient. Here is the link for SQL Test: http://www.red-gate.com/products/sql-development/sql-test/. The main purpose of the SQLTest tool is to develop unit test cases scenario and execute them in a more user friendly manner. As quoted by Red Gate: With SQL Test you can discover defects much earlier in the development cycle and make continuous integration, agile development, and test-driven development achievable goals. SQLTest is powered by tSQLt which is a framework developed by Red Gate. One of the feature which I like is the ability to check the test cases into SQL Source Control which is a another offering from Red Gate. SQL Source Control can be connected to the following:

SVN
Vault
TFS
Perforce
Git
All others
Mercurial

Please use the link : http://www.red-gate.com/products/sql-development/sql-source-control/ for further information. As Database projects continue to grow in complexity testing of database development projects have become all the more important so that integrity and accuracy of the data will be maintained.