Ram's Blog: SSIS

Showing posts with label SSIS. Show all posts

Monday, June 22, 2015

SQL Azure - SQL Server Database to Azure.

Microsoft with the change in leadership at the top, there has been a lot of push for cloud related offerings from Microsoft. One of the components of the cloud suite is SQL Azure which is the cloud based offering of the SQL Server Database system. With the advent of smartphones and increased used of apps in phones, there is a need for having a cloud based database. There are lot of folks using different cloud based offerings for the database, one of my colleagues has launched an app which uses the SQL Azure database. It is possible that folks could use a sql server initially to develop the database and then copy the database over to SQL Azure. Here is a link on how to SSIS to copy a sql server database to SQL Azure.
To quote the article: When migrating a SQL Server database to Microsoft Azure SQL Database, the SQL Server Import and Export Wizard in an easy way to create a SQL Server Integration Services package to transfer data. The package can then be modified to add more robust error handling and retry logic.

https://msdn.microsoft.com/en-us/library/azure/jj156152.aspx

Thursday, August 7, 2014

SSIS Parameters-SQL Server

Given the nature of today's ETL operations there are lot of complex SSIS packages being built to handle different aspects of Data Processing. One of the key concepts involved with multiple SSIS packages is the passing of parameters between the packages. My friend Pinal Dave, a SQL Server Expert had bought out a excellent post in his famous SQL Server blog. Here is the link to the article.
http://blog.sqlauthority.com/2014/08/07/sql-server-ssis-parameters-in-parent-child-etl-architectures-notes-from-the-field-040/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+SQLAuthority+%28Journey+to+SQL+Authority+with+Pinal+Dave%29.
Enjoy reading the article very informative...

Friday, August 1, 2014

Data Integration

One of the areas that i have increasingly working on these days is the area of data integration. Business/Companies have a variety of database systems and it is very rare that you find companies running out of one database system/platform. There is an increasing need for different database systems/architectures to co-exist. The battle at times becomes very political to replace one database/BI platform with another one in its entirety. This is where data integration comes in and there are lot of tools available in the ,market today in this space. The Gartner report for Data integration tools is available now. Based on the report Microsoft is in the Challengers quadrant, the data integration tools that encompass the Data integration framework as per the study are SSIS and BizTalk server. I have used a lot of SSIS but very rarely i have come across Biztalk servers is my projects. The reason why Microsoft is in in the challengers space is because of the lack of breadth of functionality, quoting the report: " Although Microsoft addresses core data integration requirements, non-bulk/batch data integration styles remain a gap in its offering relative to market demand. Microsoft's product strategy for aligning capabilities to data movement, transformation and orchestration (referred to as "information production") aims to broaden its market positioning". The leaders in the Data Integration Space are 1. Informatica 2. IBM 3. Oracle. Oracle's main data integration tools are Oracle Data Integrator (ODI), Oracle Data Service Integrator, Oracle GoldenGate and Oracle Warehouse Builder (OWB). One of the weakness in the Oracle integration suite is lack of skill set available with the set of tools mentioned above. Companies like Cisco(Composite Studio) and Actian fall in the visionaries category. Cisco is a new incumbent in the data information management technologies market. Adeptia and Syncsort fall in the niche players category. For the Complete report, please use the link below:

http://www.gartner.com/technology/reprints.do?id=1-1Y495V6&ct=140724&st=sb&mkt_tok=3RkMMJWWfF9wsRonv6TNe%252B%252FhmjTEU5z16e4qWqa3lMI%252F0ER3fOvrPUfGjI4DT8pgNK%252BTFAwTG5toziV8R7HNJc160s8QXBjm

The domain of data integration encompasses a wide variety of services which are increasing by the year.

Sunday, October 20, 2013

SQL Saturday - ETL,CDC and Hadoop

Finally the day arrived and I was excited to attend the SQL Saturday Event in Charlotte on October 19th. The arrangements were very well done, really liked the idea of Speedpass to make the registration process. There were a good number of vendors from product companies to consulting firms. The schedule was pretty impressive in the sense that it was broad range of topics, My first session was about ETL and ELT by Paul Rizza. The topic covered a wide range of topics in SSIS with respect to the pitfalls that one need to watch out for, the tasks and logic in SSIS which can drag performance. The interesting property about how to indicate that the records are sorted at source and how this would help prevent SSIS from performing sort before a Merge Operation. The property sort at source is indicated in the source connection advnaced properties. There was an another interesting discussion about the RetainSameConnection property. This property works well except in cases of parallel execution where the same connection cannot be utilised. There were interesting tidbits about the SSISDB catalog what to watch out for when moving between servers. There was a good discussion about Raw file types which are storngly typed. One of the biggest advantages using the Raw file type source is that the files size are a lot smaller and the strongly typed nature of the files helps SQL Server not perform data type validation before doing a bulk insert. This was an neat feature for me. The performance issue with For Each Loop Container was discussed in detail, especially when one used a For Each Loop container and uses Execute SQL Task within, how many logging operations happen. One of the good replacements for a For Each loop container would be a Data flow task in SSIS. There was a lot of good discussion about memory usage by SSIS, also a look into the Lookup tasks especially with respect to Partial Cache and Partial No Cache. The pitfalls of using command destinations and user defined functions. The session was filled with interesting tidbits about SSIS.
The second session was about Hadoop, with hadoop gaining a lot traction, the session was fully packed. The presentation by Rick Hallihan and Bill Carroll was very informative, the session began with an overview of Hadoop, MapReduce and the tools available in the Hadoop universe, then the topic move on to How Microsoft is adopting Hadoop through HD insight collaboration with Hortenworks. There was a word count example demonstrated in Hadoop and then the same repeated with a C# .Net example within visual Studio 2013. There are class libraries present in .NET which support the Hadoop Mapreduce functions. The main part of the session was to create Hadoop cluster within Windows Azure using HDInsight. The Hadoop cluster was set up in windows azure in 4 clicks, the cluster had 4 nodes. There were comparisons made between the SQL and Big data world, overall a very impressive session.
The third session i attended was about CDC (Change Data Capture) present in SQL Server 2008. The CDC feature is very useful for auditing, datawarehouse especially SCD build out. The change between CDC and CT (Change Tracking) is that with Change tracking feature only the row change information is captured and there is no history maintained with respect to Change Tracking feature. CDC will work on tables that have Primary keys only and the source database has to be a SQL Server Database. There was good example demonstrated where there was a delete of records made on a table which is enabled for CDC by mistake and how using the CDC tracking table one can recover the deleted records, in case CDC feature was not there then this would be a perfect storm. The session was presented by Jose Chinchilla (http://www.sqljoe.com).

Monday, May 13, 2013

SSIS Vs Store Procedure/T-SQL

One of the interesting debates/discussions which happen while discussing an SSIS implementation/Strategy is whether to use SSIS as a shell and do majority of the work the transformation/loading work done by SQL/Stored Procedures. The second option is to leverage the features in SSIS package and reduce the amount of TSQL/Stored procedures. In order to arrive at a optimum solution there are certain key points that need to discussed. The first would be the requirement from SSIS solution, in this there could be the following sub categories.
1. What would be the frequency of Data load into the target system.
2. What type of transformation is required.
3. What would be the requirement for the availability of data to the users.
4. What type of hardware is available at this point and how does it play into the requirement.

It is very important to understand the capabilities of the tools available while building the SSIS solution. Each piece of the ETL toolset would have a capability, positives/negatives, it is very important to understand them.
SQL/Stored Procedures are very good at performing SET based operations and CRUD operations. SSIS is very good at doing parallel processing, String based transformations, lot of functions for different type of operations based on data types, also it is graphical in nature. It is also very important to perform lot of testing while doing a SSIS implementation to get a good understanding of performance in the given server/hardware environment. With respect to transformations, get a good understanding of the complexity of the transformations required by the ETL solution. It would also be highly recommended to look at the metadata of the data sources and destinations and see if any design patterns can be arrived at. From a developer's view point it is important to understand the level of code maintenance involved.

In case you to get more insigh into SSIS best practices, please look into the discussion in Linkedin:
http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&discussionID=236893667&gid=59185&commentID=137243006&goback=%2Egmr_59185&trk=NUS_DISC_Q-subject#commentID_137243006
There are very good points listed in the link above by different SQL Experts, it is worth checking it out.

Friday, September 9, 2011

Fuzzy Lookup...SSIS...

Recently I was working on a project where in there was a requirement to check the quality of data before uploading the data to SQL Server. There are a couple of tasks in the Data flow component of SSIS which allow one to scrubb the input data. The two tasks which are available are Fuzzy Lookup and Fuzzy grouping tasks. The data source for the SSIS package had a text file which contains a list of contacts that need to be scrubbed and uploaded to the database. The contact text file could have names that are misspelt or have bad characters. As per the business there can certain set of tolerance and confidence level set based on which the misspelt records could be uploaded or rejected. This is where the fuzzy lookup task comes into play. I have enclosed a sample Fuzzy Lookup editor:

In this SSIS package, the records which have a valid lookup contact in the database is sent through the Fuzzy lookup task. The output of fuzzy lookup task can then be directed to a conditional split task. There are certain columns which are in built in the fuzzy lookup that are available to the conditional split task,they are _Similarity and _Confidence. Based on the values of these two variables and business rules records can be inserted or rejected from being uploaded to the database.

Thursday, January 13, 2011

SSIS - Access Mode...

I was working on a fairly simple SSIS package, as part of the package I had to move data from a SQL Server Database to another SQL Server Database. I used a Data Flow to do a simple copy of the data using OLE DB Source and Destination. When I executed the package, I found that the performance of the Data Copy was very slow. What was puzzling was that the number of rows to be copied were just around 1000 rows and the table was not very wide. Based on research from other blogs, there is a property called Access Mode in the OLE DB Destination Data Flow component. I Set the AccessMode to OpenRowset Using FastLoad and then re-ran the package. The data flow performed the copy of data a lot more faster and the package completed in a much quicker time.

Sunday, September 19, 2010

SQL Saturday - Raleigh...

I had the opportunity to attend the SQL Saturday session held at Raleigh on Saturday 18th sept. It was a great learning experience listening to the Speakers (all of them Microsoft MVP's) deliver interesting lectures on varied areas of SQL Server. My first session was with Rafael Sallas (MVP) on why we should Care about dimensional modelling. The lecture was an introduction to dimensional modelling,star schema, Kimball methodologies. The second session which i attended was by Tim Chapman (MVP) on Dynamic SQL. The speaker did a great job on doing demos which made the session very interactive. He showed how the performance is for dynamic sql, also showed how to minimise SQL injection attacks while using dynamic SQL. One of the examples he showed was for a web app which does dynamic searches. The dynamic SQL which he showed was really neat and picked up very good execution plans. The next session was on SQL Server 2008 SSRS new features by Mike Davis (Pragmatic Works). Lot of cool demos on Maps, Data Bars and Sparklines, Report parts and Shared Datasets. During lunch time there was a presentation by Red gate on a new product introduced by them called SQL Source Control. Very nice product works with TFS and SVN source Control systems. After lunch he first session was on identifying performance issues and fixing them. A very nice interactive session and lot of questions were asked by developers. Following this attended Andy Leonard (MVP) session on SSIS design patterns. A great lecture on how one should go about developing an SSIS framework for the enterprise. Neat demos on performing incremental loads in SSIS packages and also discussed scope of variables and performing auditing. The final session was on advanced T-SQL by Kevin Bole (MVP), solving problems using SQL and some very innovative approaches to problem solving.

Monday, May 4, 2009

SSIS 2008...

One of the challenges i was facing while developing SSIS packages was how to move around the packages through the different environments. There is a enable configuration option in the SSIS designer /BIDS. There are different kinds of configurations avaialble. They are:

XML file Configuration
Parent Package Variable
Environment Variable
Registry Entry
SQL Server

The options I choose were XML configuration and the SQL Server method. In the XML configuration option, In the Configuration wizard, Once has to choose the location for the XML File, Once that is done in the next screen, the properties for the different connection objects can be saved. This would typically contain the connection strings for different databases. In the final screen one can see all the different properties that are going to save in the XML file. Once this is complete the XML file will contain the important properties needed for the package. When the package is executed it uses the XML file to set the different connections. While moving across environments one has to create am XML file for QA and Production with the connection strings modified accordingly.

There are various other blogs across the web which describe the SQL Server method of configuring, In this case the configurations are stored in a table within a sql server database.

Ram's Blog