Middleware refers to components which facilitate movement of data from 1 or more sources to 1 or more targets with or without transformation. There are a lot of options in this area:-
Application to application data integration
In this option, an application will either expose services (via an API), which other applications can call to retrieve data from the source application. This approach is common in service orientated architecture.
The advantage of this approach is that an application is split in to services which are dedicated to a particular task with the improved performance and scalability that that provides.
Disadvantages of this approach are:-
1. You only get access to the data that the service makes available to you. There is likely to be much more data held privately.
2. Consuming services are making direct requests, which may cause performance issues to the source service.
This approach is most suitable for small scale data integration. For example, where you wish to filter a large set of records by supplying query criteria to the source service which returns a small result set.
Common services layer styles include:-
Enterprise Service Bus
In a complex service-orientated environment, application to application service calls can become inefficient and difficult to manager. Amongst other things, anEnterprise Service Bus acts as a service agnostic router for service calls, allows service calls to be monitored, can add a security layer, transforms data, passes messages
Example solutions components which provide an Enterprise Service Bus capability include:-
Application to message queue integration
In this option, an application can post a message to a message queue, once it’s completed a particular task.
This option is very useful in a workflow engine, where other applications are dependent on the completion of a task before they can start.
An advantage of this approach is that source application is buffered from the target applications that are consuming the data.
This shares the disadvantage that you can only get access to data that the application has posted to the message queue. Other data that the application may hold privately is unavailable.
Example solutions components which provide message queueing (peer to peer and publish-subscribe models) include:-
Datastore to datastore integration
Datastore to datastore integration is more suitable for “Big Data” problems, since it moves and transforms data in bulk.
A datastore can refer to:-
1. A message on a queue
2. A file
3. A table in a database
Enterprise Data Hub (EDH)
An enterprise data hub differs from a message queue in so much as it persists data to a data store and holds it for much greater longer periods of time. An EDH will typically source it’s data in a variety of formats from data sources and have multiple consumers of the data. An EDH differs from a data warehouse in that it can hold a variety of formats, typically in files, and typically in the same form as it’s held at source.
This refers to replication of data from a primary database to 1 or more homogenous standby database in read only mode. It’s purpose is for business continuity/disaster recovery.
It’s advantages over more generalised database replication software is that it guarantees that the data in the standby database is an exact mirror of the primary at all times (some solutions also guarantee that code is mirrored).
It’s disadvantages are:-
1. You can’t apply any transformations between the source and target systems.
For the Hadoop Distributed File System (HDFS), the every data block can be replicated to x other data nodes, where the data nodes can be located locally or remotely, so high availability/disaster recovery can be achieved within the core solution, by ensuring that a quorum amount of data nodes are available in an offsite location. See the How does a file get distributed in a hadoop cluster article for more details
Example solutions components which provide database mirroring include:-
This option works by having:-
i) A capture process read information from the database logs, converts the information to a database agnostic format and puts on a message queue.
ii) An apply process translates the information to sql understandable to the target datastore and applys the information to the target datastore
It’s purpose is to replicate data to target data stores where the data can be used without affecting performance of the data source.
The advantages of adding a message queue between the capture and apply processes are
1. It can buffer the capture and apply process i.e. you could temporarily apply data to a target datastore at a slower rate than data is being captured so long as the message queue size limit isn’t exceeded. This is a useful situation when you have to deal with traffic that experiences spikes in demand.
2. It allows multiple subscribers to read the data on the message queue, so you can have multiple heterogenous target data stores.
1. Due to this option allowing replication to heterogenous databases, compromises are made on what can be replicated. For example, large objects (LOBs) and database code e.g. udfs and stored procedures are typically not replicated.
2. You usually have to adjust parameters on the source database so additional information is captured in the redo logs to allow data replication to work
3. Data replication is asynchronous, so couldn’t be relied upon to provide a solid backup/disaster recovery option.
Some lightweight transformation is generally a feature of this software but since the purpose of database replication is to move data from source to target as quickly as possible, this is limited. An example of light transformation is the addition of source_system and apply_timestamp to target tables which is useful for data lineage purposes and understanding as to when the data store was populated.
Example solutions components which provide database replication include:-
Extract, Transform, Load (ETL)
There are variations on this e.g. Extract,Load & Transform (ELT), but the general idea is that you can source data from multiple data stores and apply transformation operators e.g. sort, merge, join, union, add surrogate key etc. before applying the result set to a target data store.
ETL/ELT is used where bulk data from multiple data sources needs to be integrated, transformed and pushed to one or more data targets.
Example solutions components which provide ETL and/or ELT capabilities include:-