Data warehousing philosophies & methodologies

Data warehousing refers to the need to integrate related data from your data sources, capture change history, group entities by subject area and then retain in a physical data store.

Philosophies

2 key philosophies have emerged. The original philosophy championed by Ralph Kimball & Bill Inmon adheres to the definition of data warehousing above and delivers a data warehouse that contains a ‘Single version of the truth’ i.e. Any systems which use the data in the data warehouse will produce consistent results.

A more recent philosophy championed by Dan Linstedt is that there is no such thing as a ‘Single version of the truth’, since business stakeholders have different perspectives i.e. multiple versions of the truth. As such this philosophy believes that all that can be achieved is a ‘Single version of the Facts’ which means that we should collect data from all of the data sources, and store them in an unintegrated form in an enhanced enterprise data hub. A data mart (within a data virtualisation layer) can then be produced for each business function which mirrors their perspective on what the business truth is.

Methodologies

In order to provide data in a form suitable for reporting, the following stages need to take place:-

  • Acquire – Data needs to be captured from the data sources
  • Clean – Data needs to be standardised and duplicates removed.
  • Integrate – Data from multiple data sources needs to be integrated
  • Add Business Logic – Raw data captured from source needs to have additional business logic. Examples of this are that each business function will have their own ways of measuring their function’s performance, so we need to create calculated measures so the the business function’s performance can be reported on. In a similar fashion, some of the attributes familiar to the business may need to be derived from the raw attributes captured by the data source, and data hierarchies added.

Bill Inmon

Bill Inmon's System Architecture

Bill Inmon introduced the original concept of a data warehouse. Prior to this, reporting systems would often run against a copy of the production database or against the production database itself. Bill identified that we need a new data store (a data warehouse) which is modeled differently so that reports can be produced effectively.

Bill’s proposal was that all of the data from all of the data sources should be integrated in to an enterprise data warehouse, and then data marts (within a data virtualisation layer) should be created from the enterprise data warehouse for each business function

Bill believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.

Advantages:

  • Data modeling is in 3rd Normal Form in Bill Inmon’s data warehouse, so storage is efficient
  • Since data marts are created after the enterprise data warehouse stage, these could be created in virtual form (using data virtualisation technologies like Composite or database views) or in physically stored. This is a notable advantage over Ralph Kimball’s methodology

Disadvantages:

  • Modeling the entire enterprise works fine, if you start to produce an enterprise data warehouse when your company is still very small. However, for a large organisation attempting to model the complete enterprise leads to data warehouse projects running for years without delivering anything.

Ralph Kimball

Ralph Kimball System Architecture

Ralph Kimball came up with the dimensional modeling (aka star schema) technique which is very easy to grasp from a business user’s perspective. A data mart modeled as a star schema consists of a central fact tables containing a business function’s calculated measures, diced and sliced by the dimension’s surrounding it. Ralph Kimball proposed that an enterprise data warehouse could be created by creating data marts for each business functional area, but ensuring that each dimension was conformed (i.e. made identical) so that it could be used by all data marts that make use of it. The methodology involves producing a data mart for a single business function first, and then adding a 2nd data mart for another business function and gradually building up a constellation of data marts which represent the entire enterprise

Like Bill, Ralph believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.

Advantages

  • Since the data warehouse consists of a set of data marts, there is no need for an additional data virtualisation layer (which Bill Inmon and Dan Linstedt acknowledge is required to meet the needs of a business function).
  • Since the methodology allows for a single data mart to be created for a single business function initially, for large organisations, KImball projects deliver results to the business much quicker than Bill Inmon’s methodology that requires that the entire organisation be modeled first.

Disadvantages

  • Since a data warehouse is being built in an incremental fashion and dimensions need to be conformed in order that calculated measures can be joined from different star schemas, dimensions need to be re-modeled as each new business function is analysed. The remodeling effort can be expensive.
  • Since the business logic is baked in to the data marts (within the enterprise data warehouse), the methodology is inflexible to changes in business logic.

Dan Linstedt

Dan Linstedt System Architecture

Dan Linstedt’s methodology is based on his concept of an enterprise data warehouse effectively being an enhanced enterprise data hub with data lineage and change history added, and data retained for longer than would normally be the case for an enterprise data hub. Integration of data from the data sources takes place when a data mart (within the data virtualisation layer) is created for each business function.

Dan believes that all data should be cleansed at source.

Advantages

  • Since data integration doesn’t take place until the data virtualisation layer, it is comparatively quick to create what Dan calls a data warehouse, which I’d argue is in fact an enhanced enterprise data hub.
  • Dan’s enhanced enterprise data hub stores all of the data captured from the data sources and tracks data lineage and change history, so captures an unadulterated history of what happened at source. This is different from Bill and Ralph’s methodology that integrate data from data sources, thus altering the original form of the data, prior to persisting in a data warehouse.

Disadvantages

  • Since data integration takes place in each data mart for each business function, and the dimensions within each data mart are not conformed, it is not possible to produce consistent reports across the entire organisation.
  • If data cannot be cleansed at source, Dan’s methodology doesn’t have anywhere within it, where cleansing can otherwise take place. This is a significant disadvantage since additional storage is required for duplicated records, and poor quality data will distort reported results.
Advertisements

6 thoughts on “Data warehousing philosophies & methodologies

  1. While this is an interesting post, and thank you for taking the time to discuss it, there are a few corrections I wish to make.

    1) Bill Inmon has changed his mind over the years, and now endorses the Data Vault system of Business Intelligence, and agrees with me in my approach to answering the enterprise problems. So much so, that he’s offered quotes along the way: “Data Vault is the optimal choice for data warehouses.”
    2) Bill Inmon & I recently completed / co-authored a new book, in which Data Vault is a part of a Big Data approach. The book is: A Primer for the Data Scientist, Big Data and Data Vault 2.0 – additional endorsement for the Data Vault from the man himself.
    3) The DV produces a single version of the facts FOR A POINT IN TIME. Unfortunately, the definition offered here, did not include this component. This is critical, why? because facts change over time, as does interpretation.
    4) It is an incorrect assumption to state that “it is impossible to produce consistent reports from the data warehouse” This is not correct at all. It is also not correct to say that “everything is virtualized downstream of the Data Vault warehouse”. Unfortunately the author of this article does not take in to account things like Point In Time, Bridge Tables, and the fact that we can & often DO produce conformed dimensions, whether it’s through views / virtualized, OR instantiated table structures. Just because we leave the calculations / business rules to downstream, DOESN’T mean we don’t conform data.
    5) The second “disadvantage” is also incorrect. The author of this article states that my methodology does not account for it anywhere. Sorry, but this is also incorrect. In my class: Data Vault 2.0 Boot Camp & Private Certification, I teach the following: IF you need it, you can execute cleansing logic on the way out of the Raw Data Vault – on the way to something we call a Business Data Vault. A BDV is NOT a full copy of the Raw Data, but rather an intermediate layer (like conformed dimensions if you like) where the results of cleansing can be physicalized and then shared downstream.

    Unfortunately, I haven’t had time to produce public knowledge about all these concepts, thus leading to gaps in understanding (as seen here). However, I just completed another new book (co-authored by Michael Olschimke) available on Amazon.com as a pre-order: “Building a Scalable Data Warehouse with Data Vault 2.0” – 700+ pages, soup to nuts compendium of how, why, and where all of this works. It will walk you through everything (in detail with examples) of the components which I’ve stated here.

    Now, the biggest reason why we “separate” sourcing from Interpretation – which again was ignored as a benefit of the DV2 approach:
    1) Avoid re-engineering, we can adjust and stream in real time, any and all new data directly in to the Raw DV model / architecture quickly, and efficiently.
    2) Business Rules / Interpretation change, and to be agile, we must be able to change them quickly AND apply it back in history
    3) Source systems have GAPS, and to cleanse the data upstream of the EDW / DV erases or covers those gaps, causing problems with auditability. In reality, ALL cleansing and fixing (Gap closing) should be done in the source systems themselves, this is where business realize the most ROI.

    Hope this helps clear the air,
    Dan Linstedt
    http://DanLinstedt.com

    Like

    1. Hi Dan,

      In terms of data cleansing, you don’t always have control over the data sources. For example, I used to work at a card processing company, where a lot of the data was submitted by the merchant, with the acquiring bank in between us and the merchant. There are 1000s of merchants.

      Even within a company, it’s often the case that application development is driven to implement a system quickly to get to market first, or that they choose to reduce validation checks as their web analytics has identified that there’s a drop off in user’s buying products if you enforce too many rules.

      These are good examples as to why you need somewhere in your methodology that would allow data to be cleansed when you first see the data rather than at it’s original source.

      Liked by 1 person

      1. Hello,

        I agree with your statements, not always having control over the sources, particularly when the sources are acquired through M&A or the sources are actually “external”. Although ultimately, this responsibility of alignment rests with the CxO level of the company to address. It’s still a business GAP issue, and a data alignment issue, at the end of the day I can prove that the business is losing significant revenue by not fixing and aligning the sources.

        That said, we (IT) still have to address this gap somewhere, so – in light of that, we do address it in the methodology, and again – I apologize for not having the time (before now) to actually articulate the full methodology in public view – we address this on the way out of the Raw Data Vault (EDW), and on the way to the marts. Cleansing can & often does happen, as does consolidation and conformity. We just separate the sourcing, timing, latency and availability issues from the quality, interpretation and interpolation rules.

        In fact, I currently work with a couple large banks (Commonwealth Bank and a few others) that are going through this right now – they are a multi-year effort, and are successfully implementing Data Vault 2.0 System of Business intelligence.

        Thank-you again for taking the time to blog on your thoughts, I appreciate your insights.

        Cheers,
        Dan Linstedt

        Like

  2. almost forgot… Enterprise Data Hub? I disagree. Look at Bill Inmons definition of Data Warehouse: Time Variant, Non-Volatile, Integrated, etc… The Data Vault model results in meeting ALL of these requirements. Yes, it is integrated by Business Key, but it is also time-variant, and non-volatile., resulting in meeting the full definition offered by Bill Inmon for what a Data Warehouse is and should be.

    Hope this helps,
    Dan Linstedt

    Like

    1. It’s partially integrated.

      To look at practical examples as to why this is a problem. If say, we chose product_code to be our enterprise-wide business key for product, then you’d be able to select all of the attributes about a product held in disparate data sources which shared the same product_code, but each source system would have different product attributes. You wouldn’t be able to produce a report where you were able to sum up measures, for example based on a particular attribute e.g. product_colour because attributes still have different structure, content and meaning in each satellite system. Since most reports require you to be able to filter or aggregate based on attribute rather than solely business key information, integrating just the business key is of limited benefit.

      I think Inmon’s original vision of an integrated data warehouse was that all attributes that were common between data sources and need to be reported on, should be integrated, and not just the business keys.

      Like

      1. True, Inmons original vision – that is correct. However, Inmon has changed his mind about what a Data Warehouse is, and should be. He now agrees with Data Vault 2.0

        With regards to your statements:
        “. You wouldn’t be able to produce a report where you were able to sum up measures, for example based on a particular attribute e.g. product_colour because attributes still have different structure, content and meaning in each satellite system. ”

        I disagree, there is nothing stopping you from properly integrating these measures / attributes down stream of the Raw Data Vault. Nothing stopping you at all, in fact, this is what all consumers of Raw Data Vault do today – they do not have any problems building the Business Vault, or even marts (as defined by conformity and Dr Kimball). So I am not quite convinced you can make this claim.

        Furthermore, Business need to know, and need to see these gaps – between and across these metrics. Sometimes it is intentional, other times it is not intentional. More often than not it reflects broken source systems, broken applications, broken synchronization logic in the sources – (GAP analysis), leading to poor data quality. These things cause money loss at the business level, and they need to be fixed. Because the Raw DV stores data this way, we can produce visible metrics that show How often, when, and what percentage of data has this pattern. It is up to the business to decide WHY, and if it’s important enough to fix, or if it’s by design.

        Now, to say: different content and different meaning, that would mean you have not properly constructed your Hub keys according to Grain. Why? because if the meaning changes, and /or the grain changes then you need to separate the Hub keys, thus separating the descriptive or Satellite data as well.

        To go one further: if you say that these things have a different meaning or content – then practically speaking, these attributes are not summable by this very definition, which means if we were to sum them, we would be producing incorrect and non-auditable answers for the business.

        Just my two cents,
        Dan Linstedt

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s