Canonical/Reference data modeling
This refers to an enterprise-level data model which an information architect would create in order to best capture entities and relationships between data held in the organisation. The purpose of such a model is to ensure that delivery data architects don’t re-produce the wheel or miss vital entities or relationships. When evaluating external vendor’s products, it is also useful to compare the vendor’s data models against the reference data model. Reference data model patterns are available for certain industries. For example, for banking, the slightly abstract information framework (IFW) produced by IBM is quite prevalent. The advantage of IFW is that by supplying a framework and abstract concepts, the framework could be used to create any kind of a reference data model. The disadvantage is that it’s not providing a reference data model, it’s providing an information architect with a framework to produce such a model. 2 information architects using the same framework will produce 2 different reference data models. ISO20022 business model is a more pragmatic reference data model for payment processing which supports the structure of XML messages which are used for inter-bank payments. The disadvantage of ISO20022 is that it mainly covers payment processing and needs to be extended substantially to cover the rest of the bank.
To illustrate examples of data modeling, I’ll present a situation where a customer buys a book at a store.From your business analyst, you’re told that:-
1. You have multiple stores
2. Each store has many customers
3, A customer can shop at multiple stores
3. Each store has many books
4. A customer can make multiple purchases
This refers to a data model which has entities and relationships marked on it but no information about primary keys, foreign keys or attributes. The diagram below shows an example of how the conceptual data model for the book store situation might look like. Note: That many to many relationships are not resolved at this point.
This applies detail to the conceptual data model, adding the primary & foreign keys and attributes and resolving many to many relationships. Any consuming systems (reports, data extracts, applications) should be able to trace their fields back to an attribute in a logical data model.
This takes the logical data model and applies it to a particular database system e.g. Oracle, SQL*Server, MySQL, DB2. Changes that are made when converting a logical data model to a physical data model include:-
1. Data types and column lengths are finalised.
2. Tables are assigned to tablespaces.
3. Logical entities will be converted to database tables and attributes to columns.
4. Logical entities may be de-normalised for performance reasons.
Once you’ve completed creating your physical data model in your design tool of choice e.g. ERWin, ER/Studio, Infosphere Data Architect etc., you can generate a data definition language (DDL) script which can be run to create the objects (tables and indexes) inside your database.
Data modeling techniques
This was the original way in which to model relational databases, created by Edgar F. Codd. As an aide-memoire , normalisation rules are encapsulated in the phrase “The key, the whole key and nothing but the key, so help me Codd”.
For an entity (data object) to be considered to be in 1st normal form (1NF), it must have a primary key (an attribute which uniquely identifies a record) which the attributes in a particular record are dependent upon. For example, you might assign customer_sk to be a surrogate primary key for customers.
To be in 2nd normal form, the attributes in the entity must be dependent on the whole of the key and not just parts of it. i.e. If you have a composite (made up of more than 1 attribute) primary key, the other attributes in a particular record have to be dependent on all of the keys and not just part of them.
Finally, for an entity to be in 3rd normal form, an attribute in a record has to be uniquely identified by the primary key only i.e. no additional attribute which does not form part of the primary key should need to be included to ensure uniqueness. In our example, we create a new entity store_country, so that the implied relationship between country_name and store_name in the store entity is broken. (See
Dimensional modeling was created by Ralph Kimball, who promoted the idea that it is better to build an enterprise data warehouse by focusing on 1 particular business capability first and developing a data mart consisting of a central fact table with joins to 1 or more dimensions (known as a star schema). Once the 1st data mart is complete, you then focus on the 2nd data mart. Where both data marts share dimensions, you need to conform the attributes within them so as to meets the needs of both business functions. This is 1 of the drawbacks of this approach, in that there is quite a bit of re-work if the analysis didn’t reveal the need for extra attributes initially. The main advantage of a star schema over a 3NF approach is that there is only a single join between the central fact table and each dimension. Joining data is very cpu-intensive, so reducing the number of joins increases performance. Since dimensions are de-normalised to improve cpu performance, there is a trade-off with storage costs and i/o and memory allocation, however. Another problem that arises is that if you add a new dimension and join it to an existing fact table, the existing records in the fact table either have to have a default value assigned for the new key, or the new key made nullable. A final problem is that the dimensional model loses relationships. In the example below, you would be able to report on purchases of books by customers in stores, but you wouldn’t be able to query books located in stores for which there had been no purchases.