Data vault – A flexible approach to data modeling

1 of the primary reasons given for the rise of NoSQL databases is that they store object data in the formats that web developers use with self-describing schemas requiring no data modeling e.g. JSON, XML, Key-Value pair, and which allow flexible changes to the schema. The main problem with traditional data modeling is that over time, new columns will need to be integrated in to an existing data model from a new system or from additions to an existing system. The solution to this problem is often to add new nullable columns to the end of an existing table to accommodate the new attributes without effecting existing records. The problem with that solution is that:-

  1. The columns often shouldn’t be nullable, so a compromise in the data model has been made
  2. Adding more and more columns to a table causes performance and storage issues, since you’re having to store and load larger and increasingly sparse records
  3. Any data integration scripts that were written to populate the original table need to be adapted in order to add the new columns and re-tested. This is a time consuming process.

Another idea is to add an additional (satellite) table to store the new columns. This resolves the problems identified above. This idea has essentially been taken a stage further in a modeling technique known as Data Vault modeling. The idea is that rather than attempt to store all of the attributes (describing columns) in the original table, why not just store the business key (unique id for a record that the business is familiar with) plus an artificial primary (surrogate) key. This is referred to as a hub table. All of the attributes are then placed in a satellite table from the start. The satellite table has the surrogate key in it, so that you can associate a record in the satellite table with the business key in the hub table.  Any relationships between the hub tables are also placed in to what are called link tables, in order to make the model completely flexible to change.

To provide a simple example, where we have some customers who’ve made some purchases of product from your website. We’ll simplify it by assuming they all pay in the same currency. Using traditional modeling techniques, we might come up with a model as shown below:-

traditional simplified customer purchase data model Where as, using a data vault modeling technique, you’d create a model like this:-

Simplified data vault modelThe advantage in this modeling technique comes when this web business decides that they need to start taking purchases in additional currencies. In the traditional model, you need to re-work your model and data integration scripts, and re-test any affected parts of your system whereas in the data vault model, you simply have to add new tables and your existing data model is unaffected, so no re-testing required. I’ll grey out unaffected parts of the data model in both examples, to illustrate the point Simplified data vault model after change Adding the currency table isn’t a problem because it’s new, but adding the key to the purchase table means either defaulting existing records to point to an entry in the currency table for your existing purchases, or you have to make the column nullable. In either case, you have to re-write your data integration script which populate the purchase table and re-test. However, using the data vault modelling technique, we get this model after adding currency:- Simplified data vault model after change There are just additions to the data model, the original tables are unaffected so no re-work and re-test, and the currency_sk can be correctly set to be not null. But what happens, if we acquire a new company – BigDataParis, and they have a new customer table with columns called cust_name, cust_birth_date, cust_registered date? We ask our data analyst to investigate and he tells us that cust_name is simply a merger of cust_first_name and cust_last_name and a pre-processor can be created to split the information. Cust_birth_date is the same as our existing cust_date_of_birth but is european format rather our UK format. Cust_registered_name is new, and the BigDataParis reports make extensive use of it. Our solution to this will be to have a pre-processor do the split of cust_name and reformat the birth date, so that we have the same input to our existing customer workflow. For the cust_registered_name which is unique to BigDataParis, we will create a new workflow to populate a new table BigDataParis_cust_attribs, which will be linked to our customer hub table via the customer_sk surrogate key. Pictorially, our original customer workflow looked like this:-

Original BigDataLondon customer work flow We want to keep that workflow intact so that we don’t have to re-test, so we simply add a pre-processor in front like this:-

BigDataParis customer acquisition workflow That leaves us to simply add a new workflow to deal with the BigDataParis.cust_registered_date attribute:-

New BigDataParis_customer_attribs workflow Et voila, we’ve added BigDataParis customer information to existing database without needing to alter our existing table or workflow. Note: There is a lot more to data vault modeling, but this article is highlighting the flexibility characteristic that such a model provides to a company that is rapidly evolving.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s