Sunday, October 20, 2013

SQL Saturday - ETL,CDC and Hadoop

Finally the day arrived and I was excited to attend the SQL Saturday Event in Charlotte on October 19th. The arrangements were very well done, really liked the idea of Speedpass to make the registration process. There were a good number of vendors from product companies to consulting firms. The schedule was pretty impressive in the sense that it was broad range of topics, My first session was about ETL and ELT by Paul Rizza. The topic covered a wide range of topics in SSIS with respect to the pitfalls that one need to watch out for, the tasks and logic in SSIS which can drag performance. The interesting property about how to indicate that the records are sorted at source and how this would help prevent SSIS from performing sort before a Merge Operation. The property sort at source is indicated in the source connection advnaced properties. There was an another interesting discussion about the RetainSameConnection property. This property works well except in cases of parallel execution where the same connection cannot be utilised. There were interesting tidbits about the SSISDB catalog what to watch out for when moving between servers. There was a good discussion about Raw file types which are storngly typed. One of the biggest advantages using the Raw file type source is that the files size are a lot smaller and the strongly typed nature of the files helps SQL Server not perform data type validation before doing a bulk insert. This was an neat feature for me. The performance issue with For Each Loop Container was discussed in detail, especially when one used a For Each Loop container and uses Execute SQL Task within, how many logging operations happen. One of the good replacements for a For Each loop container would be a Data flow task in SSIS. There was a lot of good discussion about memory usage by SSIS, also a look into the Lookup tasks especially with respect to Partial Cache and Partial No Cache. The pitfalls of using command destinations and user defined functions. The session was filled with interesting tidbits about SSIS.
The second session was about Hadoop, with hadoop gaining a lot traction, the session was fully packed. The presentation by Rick Hallihan and Bill Carroll  was very informative, the session began with an overview of Hadoop, MapReduce and the tools available in the Hadoop universe, then the topic move on to How Microsoft is adopting Hadoop through HD insight collaboration with Hortenworks. There was a word count example demonstrated in Hadoop and then the same repeated with a C# .Net example within visual Studio 2013. There are class libraries present in .NET which support the Hadoop Mapreduce functions. The main part of the session was to create Hadoop cluster within Windows Azure using HDInsight. The Hadoop cluster was set up in windows azure in 4 clicks, the cluster had 4 nodes. There were comparisons made between the SQL and Big data world, overall a very impressive session.
The third session i attended was about CDC (Change Data Capture) present in SQL Server 2008. The CDC feature is very useful for auditing, datawarehouse especially SCD build out. The change between CDC and CT (Change Tracking) is that with Change tracking feature only the row change information is captured and there is no history maintained with respect to Change Tracking feature. CDC will work on tables that have Primary keys only and the source database has to be a SQL Server Database. There was good example demonstrated where there was a delete of records made on a table which is enabled for CDC by mistake and how using the CDC tracking table one can recover the deleted records, in case CDC feature was not there then this would be a perfect storm. The session was presented by Jose Chinchilla (http://www.sqljoe.com).

No comments:

Post a Comment