In the BI/Data warehousing Data is of paramount importance and of course how the data has to be modelled and used for reporting is an another important task, since how one uses the data would really benefit the business. Since I mentioned about Data, it is very critical that the right and what needs to be consumed is bought into a Data warehouse. Use of redundant/unwanted data could result in unnecessary usage of space and support which can drive up the cost of doing business. Data Sourcing is an important task in life cycle of a good data warehouse/BI system. Data could be coming in from multiple sources as per requirements of the business. It is very important to have a Data Sourcing strategy since that will enable technology groups to build more meaningful data warehouse systems. I did not realise how important the paradigm of data sourcing is till I started working on a Project. I would like to list different aspects that need to be considered for data sourcing. The factors listed below is of course going to vary on the type of business one is supporting.
What type of data source is the data being sourced from (Databases/Flat Files/CSV/Spreadsheets/Mainframe/NoSQL Data Sources, to name a few...)
What type of mechanism is being used for Data Sourcing: (Is it going to be a Pull Mechanism/Push Mechanism : Depending on the type of mechanism being used , the type of handshake process needs to be clearly established between the source and target systems)
One aspect which is very important is the frequency of the data feeds into the target systems. How well can the target system handle the volume of data coming in. Is there sufficient capacity to handle the load.
It is very important that the users who are going to consume the data be engaged in Data sourcing activities. Since it is for the business eventually the data is sourced. There needs to be very good analysis of requirements which would enable the technology group to determine what attributes need to be fed into the system. This is the stage where lot of data mapping exercises could be potentially performed.
It is important to determine what the needs of users are, is it going to be real time reporting or archival type of reporting. This is going to be closely linked with the frequency of the data coming, the type of database architecture in place. As one can see from this, how all of the different aspects of databases and system architecture is closely related to each other.
Since we are dealing with Data transport, one needs to factor in what happens when there are errors, how does one recover from it. There needs to be focus on Data reconciliation and how source system would support it.
I have listed some of the key points related to data sourcing, there is always room for further discussion this topic which I will do in a another blog post.