Adding middleware to your information management system architecture


Middleware refers to components which facilitate movement of data from 1 or more sources to 1 or more targets with or without transformation. There are a lot of options in this area:-

Application to application data integration

In this option, an application will either expose services (via an API), which other applications can call to retrieve data from the source application. This approach is common in service orientated architecture.

The advantage of this approach is that an application is split in to services which are dedicated to a particular task with the improved performance and scalability that that provides.

Disadvantages of this approach are:-

1. You only get access to the data that the service makes available to you. There is likely to be much more data held privately.

2. Consuming services are making direct requests, which may cause performance issues to the source service.

This approach is most suitable for small scale data integration. For example, where you wish to filter a large set of records by supplying query criteria to the source service which returns a small result set.

Common services layer styles include:-


Enterprise Service Bus


In a complex service-orientated environment, application to application service calls can become inefficient and difficult to manager. Amongst other things, anEnterprise Service Bus acts as a service agnostic router for service calls, allows service calls to be monitored, can add a security layer, transforms data, passes messages

Example solutions components which provide an Enterprise Service Bus capability include:-

Apache ServiceMixMicrosoft BizTalk Server, IBM WebSphere Enterprise Service Bus, Tibco ActiveMatrix BusinessWorks

Application to message queue integration


In this option, an application can post a message to a message queue, once it’s completed a particular task.

This option is very useful in a workflow engine, where other applications are dependent on the completion of a task before they can start.

An advantage of this approach is that source application is buffered from the target applications that are consuming the data.

This shares the disadvantage that you can only get access to data that the application has posted to the message queue. Other data that the application may hold privately is unavailable.

Example solutions components which provide message queueing (peer to peer and publish-subscribe models) include:-

Apache KafkaIBM Websphere MQ, Microsoft Message Queuing (MSMQ)

Datastore to datastore integration

Datastore to datastore integration is more suitable for “Big Data” problems, since it moves and transforms data in bulk.

A datastore can refer to:-

1. A message on a queue
2. A file
3. A table in a database

Enterprise Data Hub (EDH)


An enterprise data hub differs from a message queue in so much as it persists data to a data store and holds it for much greater longer periods of time. An EDH will typically source it’s data in a variety of formats from data sources and have multiple consumers of the data. An EDH differs from a data warehouse in that it can hold a variety of formats, typically in files, and typically in the same form as it’s held at source.

Examples of an EDH include:-
Apache Hadoop, Informatica Data Integration Hub

Database mirroring


This refers to replication of data from a primary database to 1 or more homogenous standby database in read only mode. It’s purpose is for business continuity/disaster recovery.

It’s advantages over more generalised database replication software is that it guarantees that the data in the standby database is an exact mirror of the primary at all times (some solutions also guarantee that code is mirrored).
It’s disadvantages are:-
1. You can’t apply any transformations between the source and target systems.

For the Hadoop Distributed File System (HDFS), the every data block can be replicated to x other data nodes, where the data nodes can be located locally or remotely, so high availability/disaster recovery can be achieved within the core solution, by ensuring that a quorum amount of data nodes are available in an offsite location. See the How does a file get distributed in a hadoop cluster article for more details

Example solutions components which provide database mirroring include:-

Apache Hadoop Distributed File System (HDFS),Oracle Data Guard, IBM DB2 High Availability Disaster Recovery (HADR), Microsoft SQL*Server AlwaysOn

Database replication


This option works by having:-
i) A capture process read information from the database logs, converts the information to a database agnostic format and puts on a message queue.
ii) An apply process translates the information to sql understandable to the target datastore and applys the information to the target datastore

It’s purpose is to replicate data to target data stores where the data can be used without affecting performance of the data source.

The advantages of adding a message queue between the capture and apply processes are
1. It can buffer the capture and apply process i.e. you could temporarily apply data to a target datastore at a slower rate than data is being captured so long as the message queue size limit isn’t exceeded. This is a useful situation when you have to deal with traffic that experiences spikes in demand.
2. It allows multiple subscribers to read the data on the message queue, so you can have multiple heterogenous target data stores.

Disadvantages include:-
1. Due to this option allowing replication to heterogenous databases, compromises are made on what can be replicated. For example, large objects (LOBs) and database code e.g. udfs and stored procedures are typically not replicated.
2. You usually have to adjust parameters on the source database so additional information is captured in the redo logs to allow data replication to work
3. Data replication is asynchronous, so couldn’t be relied upon to provide a solid backup/disaster recovery option.

Some lightweight transformation is generally a feature of this software but since the purpose of database replication is to move data from source to target as quickly as possible, this is limited. An example of light transformation is the addition of source_system and apply_timestamp to target tables which is useful for data lineage purposes and understanding as to when the data store was populated.

Example solutions components which provide database replication include:-

Oracle GoldenGate, IBM Infosphere Data Replication, Microsoft SQL*Server Transactional Replication

Extract, Transform, Load (ETL)

flat files joiner

There are variations on this e.g. Extract,Load & Transform (ELT), but the general idea is that you can source data from multiple data stores and apply transformation operators e.g. sort, merge, join, union, add surrogate key etc. before applying the result set to a target data store.

ETL/ELT is used where bulk data from multiple data sources needs to be integrated, transformed and pushed to one or more data targets.

Example solutions components which provide ETL and/or ELT capabilities include:-

Pentaho Data IntegrationInformatica PowerCenter, Ab Initio, IBM Infosphere Datastage, Oracle Data Integrator (ODI)Microsoft SQL Server Integration Services (SSIS)


Approaches to storing and making available Big Data to consumers

There are 2 main approaches underway for dealing with the problem that an organisation has in needing to collate, cleanse, standardise and integrate data generated within and outside of the enterprise.

Enterprise Data Hub (aka a Data Lake or Data Integration Hub)

In this approach, the idea is to simply pull data from a variety of data sources in to a central data hub, clean the data and store it in a slightly purified raw form. Data is then integrated on a use case by use case basis.

Advantage of this approach

i) Relatively quick to implement, with just pre-processing/cleaning of data required before storage
ii) Can deliver data in near real time to downstream systems
iii) Allows integration on a use case by use case basis.

Disadvantages of this approach

i) For each use case, data integration scripts need to be created
ii) Since data integration isn’t centralised, each data integration solution may produce different results.
iii) The enterprise data hub can rapidly become an “integration hairball“.
iv) There is a tendency to not pre-process data using this method, creating a data swamp rather than a data lake.

Enterprise Data Warehouse (EDW)

EDW can mean different things to different people but here I’ll use it in it’s broadest sense where it holds transaction level data (often called an operational data store) as well as aggregated data and links transaction data to standardised master & reference data.

I’ll split this down in to 2 well established sub-processes

Bill Inmon Approach

In this approach, data is typically sourced from an enterprise data hub, but data from multiple sources and multiple business functions is integrated in to a single data model, from which business function specific data models can be generated (aka data marts).

Advantages of this approach

i) Once the EDW is complete, then all of the use cases can be met from an integrated data model which means they’ll produce consistent results.
ii) EDW is able to provide integrated data from across the enterprise
iii) No danger of an “integration hairball”

Disadvantages of this approach

i) For a small company, producing an EDW is relatively straightforward, but for large organisations especially those that are evolving rapidly, producing an integrated data model which covers the entire enterprise is a well known “money pit”.

ii) Even if the EDW is completed, for rapidly evolving organisations, an integrated data model is often time consuming to adjust, leading to a substantial lag time between when data starts to be produced at source and when development is complete to add the new data so that it’s available.

Ralph Kimball approach

Ralph Kimball proposed that an information management team should go for quick wins, by picking out those parts of the business which would yield the most dividends to the organisation first e.g. sales and creating a data mart for that business function first, and then moving on to the next business function, evolving an EDW by joining up multiple data marts.

Advantages of this approach

i) Relatively quick to implement (although slower than an enterprise data hub)
ii) Provides standardised data for a particular business function

Disadvantages of this approach

i) In order to report on data across multiple business functions, the dimensions (e.g. customer, geography, calendar date etc.) need to be conformed (made identical). This means re-working dimensions that were created for the 1st business function to add new attributes and data identified as required by subsequent data functions
ii) In order to provide cross-business function data, data has to be joined from multiple fact tables. This is relatively poorly performing in query times.

Data Modeling explained

Canonical/Reference data modeling

This refers to an enterprise-level data model which an information architect would create in order to best capture entities and relationships between data held in the organisation. The purpose of such a model is to ensure that delivery data architects don’t re-produce the wheel or miss vital entities or relationships. When evaluating external vendor’s products, it is also useful to compare the vendor’s data models against the reference data model. Reference data model patterns are available for certain industries. For example, for banking, the slightly abstract information framework (IFW) produced by IBM is quite prevalent. The advantage of IFW is that by supplying a framework and abstract concepts, the framework could be used to create any kind of a reference data model. The disadvantage is that it’s not providing a reference data model, it’s providing an information architect with a framework to produce such a model. 2 information architects using the same framework will produce 2 different reference data models. ISO20022 business model is a more pragmatic reference data model for payment processing which supports the structure of XML messages which are used for inter-bank payments. The disadvantage of ISO20022 is that it mainly covers payment processing and needs to be extended substantially to cover the rest of the bank.

Business Requirements

To illustrate examples of data modeling, I’ll present a situation where a customer buys a book at a store.From your business analyst, you’re told that:-

1. You have multiple stores

2. Each store has many customers

3, A customer can shop at multiple stores

3. Each store has many books

4. A customer can make multiple purchases

Conceptual data model

This refers to a data model which has entities and relationships marked on it but no information about primary keys, foreign keys or attributes. The diagram below shows an example of how the conceptual data model for the book store situation might look like. Note: That many to many relationships are not resolved at this point.

Conceptual Data Model

Logical data model

This applies detail to the conceptual data model, adding the primary & foreign keys and attributes and resolving many to many relationships. Any consuming systems (reports, data extracts, applications) should be able to trace their fields back to an attribute in a logical data model.

Logical Data Model

Physical data model

This takes the logical data model and applies it to a particular database system e.g. Oracle, SQL*Server, MySQL, DB2. Changes that are made when converting a logical data model to a physical data model include:-

1. Data types and column lengths are finalised.

2. Tables are assigned to tablespaces.

3. Logical entities will be converted to database tables and attributes to columns.

4. Logical entities may be de-normalised for performance reasons.

Once you’ve completed creating your physical data model in your design tool of choice e.g. ERWin, ER/Studio, Infosphere Data Architect etc., you can generate a data definition language (DDL) script which can be run to create the objects (tables and indexes) inside your database.

Data modeling techniques

3rd Normal Form

This was the original way in which to model relational databases, created by Edgar F. Codd. As an aide-memoire , normalisation rules are encapsulated in the phrase “The key, the whole key and nothing but the key, so help me Codd”.

For an entity (data object) to be considered to be in 1st normal form (1NF), it must have a primary key (an attribute which uniquely identifies a record) which the attributes in a particular record are dependent upon. For example, you might assign customer_sk to be a surrogate primary key for customers.

1st Normal Form

To be in 2nd normal form, the attributes in the entity must be dependent on the whole of the key and not just parts of it. i.e. If you have a composite (made up of more than 1 attribute) primary key,  the other attributes in a particular record have to be dependent on all of the keys and not just part of them.

2nd normal form

Finally, for an entity to be in 3rd normal form, an attribute in a record has to be uniquely identified by the primary key only i.e. no additional attribute which does not form part of the primary key should need to be included to ensure uniqueness. In our example, we create a new entity store_country, so that the implied relationship between country_name and store_name in the store entity is broken. (See


Dimensional modeling was created by Ralph Kimball, who promoted the idea that it is better to build an enterprise data warehouse by focusing on 1 particular business capability first and developing a data mart consisting of a central fact table with joins to 1 or more dimensions (known as a star schema). Once the 1st data mart is complete, you then focus on the 2nd data mart. Where both data marts share dimensions, you need to conform the attributes within them so as to meets the needs of both business functions. This is 1 of the drawbacks of this approach, in that there is quite a bit of re-work if the analysis didn’t reveal the need for extra attributes initially. The main advantage of a star schema over a 3NF approach is that there is only a single join between the central fact table and each dimension. Joining data is very cpu-intensive, so reducing the number of joins increases performance. Since dimensions are de-normalised to improve cpu performance, there is a trade-off with storage costs and i/o and memory allocation, however. Another problem that arises is that if you add a new dimension and join it to an existing fact table, the existing records in the fact table either have to have a default value assigned for the new key, or the new key made nullable. A final problem is that the dimensional model loses relationships. In the example below, you would be able to report on purchases of books by customers in stores, but you wouldn’t be able to query books located in stores for which there had been no purchases.

Dimensional Model

Analytics Job Roles which use Big Data

A veritable proliferation of job titles over the years in the analytics space, which this post will try to explain:-

Business Intelligence (BI)

After reporting, this is probably the most common term in the analytics space. It refers to people who write reports and dashboards which graphically demonstrate key performance indicators – KPIs (aka measures) for particular business units. Reports can be diced and sliced by dimensions which are grouped in to hierarchies and which have attributes. BI specialists will typically use tools such as Cognos, OBIEE, SSRS/AS, Microstrategy & Business Objects (amongst) others to produce reports & dashboards. The other main area of BI is data discovery/insights, where data can be pulled in to a tool and manipulated and graphed immediately in order that patterns/trends can be discovered more quickly than it takes to develop a report or dashboard. Tableau and Qlik are the best known tools for doing this.

Business Analysts

Capture the BI requirements from the business, produce wireframes of reports and dashboards, and produce slide decks covering KPIs, Dimensions, Hierarchies, known data quality issues for approval by data governance bodies.

Data Analysts

Profile data held at source to assist a data architect in developing a data model and a data engineer in cleaning the data prior to analysis.


Old job title for people with Maths PHDs who use statistics to identify patterns and control risk.

Data Scientists

New job title for guys typically with Maths PHDs who use statistics to model patterns in the real world. For example, they might produce a model which calculates weightings to be applied to customer’s credit risk or potential fraudulent behaviour. Incoming data is then compared against the model and final scores computed.
Data scientists will typically produce models and visual output using programming languages with statistical functions such as R, Python, Scala and Matlab.

Machine Learning

Also known as artifical intelligence. Neural networks used to be an earlier form of this area. It’s a subgroup of data science, where the model can be trained by using initial seed data, rather than having rules pre-programmed.

Data Mining

Refers to algorithms which detect hitherto unknown patterns in data sets.

Quant Analysts

Data scientists working on trading platforms for investment banks


Data scientists working in the insurance industry to price premiums and produce claims risk models.

Predictive Modeling

Determination of likelihood of future data based on the known past. Usually comes with confidence ranges which get wider with time. Works well for deterministic problems but not problems with complex variables e.g. stock markets or global warming.

Data Engineers

Term for roles which involve getting data from source to target without loss, cleaning, classifying and preparing the data so that analytics can work successfully.

How do jobs get scheduled and managed in Hadoop?

In a previous article, I’ve described how does a file get distributed in a hadoop cluster.

In this article, I’ll describe how jobs get scheduled in Hadoop so as to use this data.

One of the key benefits that Hadoop offers is that rather than data being fetched from a distributed file cluster to a central server and a job being executed against that data, Hadoop sends the job to where the data is. The diagram below explains how this works:-

Hadoop architecture

In order that a particular application does not absorb too many resouces (cpu, memory, i/o), the concept of a resource container has been created, in which a particular application can utilise only the resources that a container allocates to it.

A node can refer to a single server or a server which has been logically partitioned so that each partition appears to a client to be a separate server.

With Hadoop:-

1. The client application submits a job to the resource manager.

2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.

3. The application master divides the job in to tasks and allocates it to each data node.

4. On each data node, a Node manager manages the containers in which the tasks run.

5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.

6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.

The advantage of this architecture include:-

1. A client application does not have to worry about how hadoop works internally, it communicates only with the resource manager.

2. Resources are allocated based on the needs of each task and managed appropriately so that 1 task does not grab all of the resources of a particular node.

Links to useful Big Data development websites

Enterprise Data Hub

Get Hadoop & Spark on your desktop (via a virtual machine) – Hortonworks, MapR, Cloudera

Get Hadoop & Spark on the cloud – Amazon Elastic MapReduce 

Data Integration

Stream weblogs in to a Hadoop Distributed File System (HDFS) – Apache Flume

Process streams of data – Apache Storm

Implement a publish-subscribe message queue system – Apache Kafka

Push/pull data to/from Hadoop to a relational database – Apache Sqoop

Extract data from any website –

Integrate publicly available 3rd party data via a web service – Public APIs


Data Discovery

Query your big data using standard SQL in real time – Apache Drill

Discover insights from your data – Tableau, Qlik


Plot graphs online from your data –

Statistical analysis (aka Data Science)

Data analysis with some statistical calculations – Python

Serious statistical analysis – R programming

Reporting & Dashboards

Enterprise scale reporting & dashboards – Cognos, Business Objects, Microstrategy, OBIEE, SSRS