1 of the primary reasons given for the rise of NoSQL databases is that they store object data in the formats that web developers use with self-describing schemas requiring no data modeling e.g. JSON, XML, Key-Value pair, and which allow flexible changes to the schema. The main problem with traditional data modeling is that over time, new columns will need to be integrated in to an existing data model from a new system or from additions to an existing system. The solution to this problem is often to add new nullable columns to the end of an existing table to accommodate the new attributes without effecting existing records. The problem with that solution is that:-
- The columns often shouldn’t be nullable, so a compromise in the data model has been made
- Adding more and more columns to a table causes performance and storage issues, since you’re having to store and load larger and increasingly sparse records
- Any data integration scripts that were written to populate the original table need to be adapted in order to add the new columns and re-tested. This is a time consuming process.
Another idea is to add an additional (satellite) table to store the new columns. This resolves the problems identified above. This idea has essentially been taken a stage further in a modeling technique known as Data Vault modeling. The idea is that rather than attempt to store all of the attributes (describing columns) in the original table, why not just store the business key (unique id for a record that the business is familiar with) plus an artificial primary (surrogate) key. This is referred to as a hub table. All of the attributes are then placed in a satellite table from the start. The satellite table has the surrogate key in it, so that you can associate a record in the satellite table with the business key in the hub table. Any relationships between the hub tables are also placed in to what are called link tables, in order to make the model completely flexible to change.
To provide a simple example, where we have some customers who’ve made some purchases of product from your website. We’ll simplify it by assuming they all pay in the same currency. Using traditional modeling techniques, we might come up with a model as shown below:-
The advantage in this modeling technique comes when this web business decides that they need to start taking purchases in additional currencies. In the traditional model, you need to re-work your model and data integration scripts, and re-test any affected parts of your system whereas in the data vault model, you simply have to add new tables and your existing data model is unaffected, so no re-testing required. I’ll grey out unaffected parts of the data model in both examples, to illustrate the point Adding the currency table isn’t a problem because it’s new, but adding the key to the purchase table means either defaulting existing records to point to an entry in the currency table for your existing purchases, or you have to make the column nullable. In either case, you have to re-write your data integration script which populate the purchase table and re-test. However, using the data vault modelling technique, we get this model after adding currency:- There are just additions to the data model, the original tables are unaffected so no re-work and re-test, and the currency_sk can be correctly set to be not null. But what happens, if we acquire a new company – BigDataParis, and they have a new customer table with columns called cust_name, cust_birth_date, cust_registered date? We ask our data analyst to investigate and he tells us that cust_name is simply a merger of cust_first_name and cust_last_name and a pre-processor can be created to split the information. Cust_birth_date is the same as our existing cust_date_of_birth but is european format rather our UK format. Cust_registered_name is new, and the BigDataParis reports make extensive use of it. Our solution to this will be to have a pre-processor do the split of cust_name and reformat the birth date, so that we have the same input to our existing customer workflow. For the cust_registered_name which is unique to BigDataParis, we will create a new workflow to populate a new table BigDataParis_cust_attribs, which will be linked to our customer hub table via the customer_sk surrogate key. Pictorially, our original customer workflow looked like this:-
Et voila, we’ve added BigDataParis customer information to existing database without needing to alter our existing table or workflow. Note: There is a lot more to data vault modeling, but this article is highlighting the flexibility characteristic that such a model provides to a company that is rapidly evolving.