Data warehousing refers to the need to integrate related data from your data sources, capture change history, group entities by subject area and then retain in a physical data store.
2 key philosophies have emerged. The original philosophy championed by Ralph Kimball & Bill Inmon adheres to the definition of data warehousing above and delivers a data warehouse that contains a ‘Single version of the truth’ i.e. Any systems which use the data in the data warehouse will produce consistent results.
A more recent philosophy championed by Dan Linstedt is that there is no such thing as a ‘Single version of the truth’, since business stakeholders have different perspectives i.e. multiple versions of the truth. As such this philosophy believes that all that can be achieved is a ‘Single version of the Facts’ which means that we should collect data from all of the data sources, and store them in an unintegrated form in an enhanced enterprise data hub. A data mart (within a data virtualisation layer) can then be produced for each business function which mirrors their perspective on what the business truth is.
In order to provide data in a form suitable for reporting, the following stages need to take place:-
- Acquire – Data needs to be captured from the data sources
- Clean – Data needs to be standardised and duplicates removed.
- Integrate – Data from multiple data sources needs to be integrated
- Add Business Logic – Raw data captured from source needs to have additional business logic. Examples of this are that each business function will have their own ways of measuring their function’s performance, so we need to create calculated measures so the the business function’s performance can be reported on. In a similar fashion, some of the attributes familiar to the business may need to be derived from the raw attributes captured by the data source, and data hierarchies added.
Bill Inmon introduced the original concept of a data warehouse. Prior to this, reporting systems would often run against a copy of the production database or against the production database itself. Bill identified that we need a new data store (a data warehouse) which is modeled differently so that reports can be produced effectively.
Bill’s proposal was that all of the data from all of the data sources should be integrated in to an enterprise data warehouse, and then data marts (within a data virtualisation layer) should be created from the enterprise data warehouse for each business function
Bill believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.
- Data modeling is in 3rd Normal Form in Bill Inmon’s data warehouse, so storage is efficient
- Since data marts are created after the enterprise data warehouse stage, these could be created in virtual form (using data virtualisation technologies like Composite or database views) or in physically stored. This is a notable advantage over Ralph Kimball’s methodology
- Modeling the entire enterprise works fine, if you start to produce an enterprise data warehouse when your company is still very small. However, for a large organisation attempting to model the complete enterprise leads to data warehouse projects running for years without delivering anything.
Ralph Kimball came up with the dimensional modeling (aka star schema) technique which is very easy to grasp from a business user’s perspective. A data mart modeled as a star schema consists of a central fact tables containing a business function’s calculated measures, diced and sliced by the dimension’s surrounding it. Ralph Kimball proposed that an enterprise data warehouse could be created by creating data marts for each business functional area, but ensuring that each dimension was conformed (i.e. made identical) so that it could be used by all data marts that make use of it. The methodology involves producing a data mart for a single business function first, and then adding a 2nd data mart for another business function and gradually building up a constellation of data marts which represent the entire enterprise
Like Bill, Ralph believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.
- Since the data warehouse consists of a set of data marts, there is no need for an additional data virtualisation layer (which Bill Inmon and Dan Linstedt acknowledge is required to meet the needs of a business function).
- Since the methodology allows for a single data mart to be created for a single business function initially, for large organisations, KImball projects deliver results to the business much quicker than Bill Inmon’s methodology that requires that the entire organisation be modeled first.
- Since a data warehouse is being built in an incremental fashion and dimensions need to be conformed in order that calculated measures can be joined from different star schemas, dimensions need to be re-modeled as each new business function is analysed. The remodeling effort can be expensive.
- Since the business logic is baked in to the data marts (within the enterprise data warehouse), the methodology is inflexible to changes in business logic.
Dan Linstedt’s methodology is based on his concept of an enterprise data warehouse effectively being an enhanced enterprise data hub with data lineage and change history added, and data retained for longer than would normally be the case for an enterprise data hub. Integration of data from the data sources takes place when a data mart (within the data virtualisation layer) is created for each business function.
Dan believes that all data should be cleansed at source.
- Since data integration doesn’t take place until the data virtualisation layer, it is comparatively quick to create what Dan calls a data warehouse, which I’d argue is in fact an enhanced enterprise data hub.
- Dan’s enhanced enterprise data hub stores all of the data captured from the data sources and tracks data lineage and change history, so captures an unadulterated history of what happened at source. This is different from Bill and Ralph’s methodology that integrate data from data sources, thus altering the original form of the data, prior to persisting in a data warehouse.
- Since data integration takes place in each data mart for each business function, and the dimensions within each data mart are not conformed, it is not possible to produce consistent reports across the entire organisation.
- If data cannot be cleansed at source, Dan’s methodology doesn’t have anywhere within it, where cleansing can otherwise take place. This is a significant disadvantage since additional storage is required for duplicated records, and poor quality data will distort reported results.