Adding middleware to your information management system architecture


Middleware refers to components which facilitate movement of data from 1 or more sources to 1 or more targets with or without transformation. There are a lot of options in this area:-

Application to application data integration

In this option, an application will either expose services (via an API), which other applications can call to retrieve data from the source application. This approach is common in service orientated architecture.

The advantage of this approach is that an application is split in to services which are dedicated to a particular task with the improved performance and scalability that that provides.

Disadvantages of this approach are:-

1. You only get access to the data that the service makes available to you. There is likely to be much more data held privately.

2. Consuming services are making direct requests, which may cause performance issues to the source service.

This approach is most suitable for small scale data integration. For example, where you wish to filter a large set of records by supplying query criteria to the source service which returns a small result set.

Common services layer styles include:-


Enterprise Service Bus


In a complex service-orientated environment, application to application service calls can become inefficient and difficult to manager. Amongst other things, anEnterprise Service Bus acts as a service agnostic router for service calls, allows service calls to be monitored, can add a security layer, transforms data, passes messages

Example solutions components which provide an Enterprise Service Bus capability include:-

Apache ServiceMixMicrosoft BizTalk Server, IBM WebSphere Enterprise Service Bus, Tibco ActiveMatrix BusinessWorks

Application to message queue integration


In this option, an application can post a message to a message queue, once it’s completed a particular task.

This option is very useful in a workflow engine, where other applications are dependent on the completion of a task before they can start.

An advantage of this approach is that source application is buffered from the target applications that are consuming the data.

This shares the disadvantage that you can only get access to data that the application has posted to the message queue. Other data that the application may hold privately is unavailable.

Example solutions components which provide message queueing (peer to peer and publish-subscribe models) include:-

Apache KafkaIBM Websphere MQ, Microsoft Message Queuing (MSMQ)

Datastore to datastore integration

Datastore to datastore integration is more suitable for “Big Data” problems, since it moves and transforms data in bulk.

A datastore can refer to:-

1. A message on a queue
2. A file
3. A table in a database

Enterprise Data Hub (EDH)


An enterprise data hub differs from a message queue in so much as it persists data to a data store and holds it for much greater longer periods of time. An EDH will typically source it’s data in a variety of formats from data sources and have multiple consumers of the data. An EDH differs from a data warehouse in that it can hold a variety of formats, typically in files, and typically in the same form as it’s held at source.

Examples of an EDH include:-
Apache Hadoop, Informatica Data Integration Hub

Database mirroring


This refers to replication of data from a primary database to 1 or more homogenous standby database in read only mode. It’s purpose is for business continuity/disaster recovery.

It’s advantages over more generalised database replication software is that it guarantees that the data in the standby database is an exact mirror of the primary at all times (some solutions also guarantee that code is mirrored).
It’s disadvantages are:-
1. You can’t apply any transformations between the source and target systems.

For the Hadoop Distributed File System (HDFS), the every data block can be replicated to x other data nodes, where the data nodes can be located locally or remotely, so high availability/disaster recovery can be achieved within the core solution, by ensuring that a quorum amount of data nodes are available in an offsite location. See the How does a file get distributed in a hadoop cluster article for more details

Example solutions components which provide database mirroring include:-

Apache Hadoop Distributed File System (HDFS),Oracle Data Guard, IBM DB2 High Availability Disaster Recovery (HADR), Microsoft SQL*Server AlwaysOn

Database replication


This option works by having:-
i) A capture process read information from the database logs, converts the information to a database agnostic format and puts on a message queue.
ii) An apply process translates the information to sql understandable to the target datastore and applys the information to the target datastore

It’s purpose is to replicate data to target data stores where the data can be used without affecting performance of the data source.

The advantages of adding a message queue between the capture and apply processes are
1. It can buffer the capture and apply process i.e. you could temporarily apply data to a target datastore at a slower rate than data is being captured so long as the message queue size limit isn’t exceeded. This is a useful situation when you have to deal with traffic that experiences spikes in demand.
2. It allows multiple subscribers to read the data on the message queue, so you can have multiple heterogenous target data stores.

Disadvantages include:-
1. Due to this option allowing replication to heterogenous databases, compromises are made on what can be replicated. For example, large objects (LOBs) and database code e.g. udfs and stored procedures are typically not replicated.
2. You usually have to adjust parameters on the source database so additional information is captured in the redo logs to allow data replication to work
3. Data replication is asynchronous, so couldn’t be relied upon to provide a solid backup/disaster recovery option.

Some lightweight transformation is generally a feature of this software but since the purpose of database replication is to move data from source to target as quickly as possible, this is limited. An example of light transformation is the addition of source_system and apply_timestamp to target tables which is useful for data lineage purposes and understanding as to when the data store was populated.

Example solutions components which provide database replication include:-

Oracle GoldenGate, IBM Infosphere Data Replication, Microsoft SQL*Server Transactional Replication

Extract, Transform, Load (ETL)

flat files joiner

There are variations on this e.g. Extract,Load & Transform (ELT), but the general idea is that you can source data from multiple data stores and apply transformation operators e.g. sort, merge, join, union, add surrogate key etc. before applying the result set to a target data store.

ETL/ELT is used where bulk data from multiple data sources needs to be integrated, transformed and pushed to one or more data targets.

Example solutions components which provide ETL and/or ELT capabilities include:-

Pentaho Data IntegrationInformatica PowerCenter, Ab Initio, IBM Infosphere Datastage, Oracle Data Integrator (ODI)Microsoft SQL Server Integration Services (SSIS)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s