Tuesday, December 31, 2013

Happy New Year - 2014...

Year 2013 had a lot of changes/new entrants in the area of BI/Big Data/Cloud/Mobile Analytics. It is exciting times and hoping the new year 2014 offers more clarity and direction in this domain. Wishing every one and all the blog readers a happy and prosperous New Year 2014. I hope to write more insightful articles in the coming year, trying to motivate myself to provide better content. Have Fun and Be Safe.

Wednesday, December 25, 2013

Big Data and Hadoop...

In the coming year 2014, which technologies will take off, there is a lot of discussion about Cloud, Mobile, Big Data and Self Service BI. It remains to be seen if Big Data takes the leap in 2014 and see more adoptions of Big Data related technologies. One of the most important aspects to Big data is to understand what Big Data is, where would the technologies be applicable. There is a lot of hype around Hadoop, so it is very important to understand the nature of requirements and then make a decision on whether Hadoop is relevant or not. Please use the following link for a good tutorial on Hadoop and the related tool set. HDFS is not a good fit in situations where there are
1) Lots of small files,
2) Low latency data access
3) Multiple writers.

Hope the readers will find the above video useful. Also there is a another series on big data published by SQL Server Expert and Friend, Pinal Dave in his blog: http://blog.sqlauthority.com/2013/10/30/big-data-learning-basics-of-big-data-in-21-days-bookmark/, it is a 21 part series well researched and presented. Pinal is one of my favorite authors, he was a big part of the reason why I decided to start a blog on SQL Server BI.

Friday, December 20, 2013

Power BI....

There are some interesting trends to look forward to in the BI Space, there are advances being made in Cloud BI offerings from different Vendors. Also there is a possibility as discussed by leading BI experts that traditional BI Implementations could slowdown in the Year 2014, with more opportunities for newer type of BI implementations such as Self Service BI. Keep in tune with the cloud based offerings, Microsoft is offering a suite of products with the cloud version of office, Office 365. Microsoft is providing a set of tools that Interact with data in Office 365 and Excel. As per Microsoft "It's easier than ever to transform complex data into meaningful insights and share them with your team. Power BI for Office 365 is a powerful new way to visualize and interact with data in Excel and Office 365. Search, discover, and with just a few clicks, analyze, shape, and transform your data into stunning interactive visualizations that uncover hidden insights. Work, collaborate, and share from anywhere, on any device with Power BI for Office 365".


Power BI has a different sets of tools such as:
PowerPivot - To model and Analyze.
PowerView and PowerMap - To Visualize Data.

PowerBI also provides the ability to access data anywhere and also share, collaborate with the different teams. The Power BI team at Microsoft also maintains a blog at the site below:

It will be really interesting to see how companies will adopt the Power BI set of tools from Microsoft.

Monday, December 16, 2013

Bulk Copy/Bulk Insert - Data Sourcing

Recently I have been working on a project which invloves Data being sourced to SQL Server. As part of the handshake between the source and target SQL Server System, A log file had to be read to get information regarding how many records were pushed to different tables as required by the Project. As part of consuming the log file text information into a table in sql server, I was using BULK INSERT command which was working fine. One fine day the process of data sourcing failed, because the logic in the job could not identify the rowcounts present in the Log file content. As I dug deeper I found out that the Log file data was getting internally ordered by a timestamp value present in the begining of each line, this internal sorting was happening because of the BULK INSERT command. The issue was that when the Day crossed from 11:59 PM to 12:01 AM, the contents of the log file in the table got jumbled up, this caused the ordering source to be distorted. As part of a alternative solution we tried using BCP to consume the Log file (text file content) into the table, also one of the main issue that was crucial for us was to maintain the ordering present in the log file and not tamper with that sequencing. When we built the BCP solution, in order to be absolutely sure that we are reading the file in a forward fashion is to use the parameter, [-b batch_size] as part of the bcp command. By Default BCP will consume the file (source log file) as one big batch. Since we wanted to maintain the ordering of the log file, i set the batch size to 1, so the parameter -b 1 was added to the bcp command which was used to source the log file. Once this was set, the log file contents were read into the table in sql server one line at a time and the ordering of the log file at source was maintained. This helped my process to find out the counts from the log file and validate the data in the target. It was kind of challenging to figure out the data being sorted and also use bulk data import tools to read the source data one record at a time, the performance was not affected as the log file was not huge. It is also important test as much as possible when one is trying to integrate/source from another system.

Tuesday, December 10, 2013

Data Integration-Data Readiness

Currently I have been working on lot of Data Integration projects, this involves lot of sourcing of data from various system. While sourcing data one of the key aspects to find out if data is ready at source. There are also situations where some other systems could be using your system as a source. In both of the situations it is very important to come up with a Data Readiness strategy. It is extremely important to have a good sourcing strategy since there are SLA's that are to be met and the Business needs data more than ever before to make good decisions. What is involved in Data Readiness? The source systems need to come up with a plan to have data readiness indicators once all the objects/tables have been updated. In my projects the Data Readiness is table driven, normally there is a table which would have the following attributes:
When the source tables are updated with data and all the data related operations are complete, there would be a process to update the DataReadiness table which update all the above attributes and set the CompleteFlag Indicator = 'Y'. The Systems which would need data from the source would keep polling the DataReadiness table to check the CompletedStatus and also the UpdateDatetime. The reason to check the UpdateDatetime column is to make sure that the system that is pulling data from the source is trying to get the most recent update on the source system. The Data Readiness layer allows a level of abstraction in the sense that systems requiring data need not check individual source tables. The Data Readiness table can be used to indicate when the jobs need to be run to pull the actual data. In case Target systems are lagging behind with respect to the data update, the Data Readiness layer can be used to catch up with the latest updates on the source systems.