Prior to considering where you might include a big data solution within your organisation, it is worth taking a look at what a modern information management system architecture looks like.
Over the years, information management (the management of data and business information) has evolved in various steps from running reports directly against transaction processing systems, to the architecture outlined in the diagram below:-
The systems architecture has been split in to 4 sections:-
1. Data sources are a collection of data which is already stored in a structured format in a relational database or consists of semi-structured e.g. email, structured e.g. web logs or unstructured e.g. video form in external files.
The main data sources will usually be the transactional data sources. Examples of what I term a transactional data source would be a database that holds information entered via a website, or a backend processing system which generates invoices.
If you’re very lucky, there will be a pre-existing master data store within your organisation which holds a golden source of master data to which you can link reference data e.g. customer, product, geography, dates, addresses etc. which exist in your transactional data sources to. More usually, either a master data store doesn’t exist or extra master data has to be generated in order to meet the needs of a reporting system. I will write in detail about the purpose of a master data store and associated master data management application in another article. For consistent reporting, this component is absolutely vital to the success of an information management project.
2. Data consumers are the set of data analytics, applications, reports, data extracts and data discovery tools which consume the data which resides in the enterprise data hub and enterprise data warehouse.
Data Analytics tools are used by data scientists to produce models emulating patterns of consumer behaviour and to write algorithms to match new consumers to these models, in order to target products to that consumer. Popular programming languages which help in the development of data analytics are R, Matlab and Scala
An example of an application might be a customer query system which wants to retrieve information about a particular customer.
A report would be a static report which provides information to an end user either on screen or in pdf, csv or excel format (and other formats dependent on the capabilities of the reporting tool). Example of a static report might be a report showing today’s sales. Popular enterprise reporting tools include Cognos, Microstrategy, Business Objects, SSRS, OBIEE.
A data extract is data that is to be sent to another system for further processing. Examples of these might be data that needs to be fed to an invoice system or a service support system.
- A data archive. Nowadays, data is usually held in 3 separate areas referred to as hot (online active), warm (online rarely used) and cold (offline very rarely used) storage. Hot data is kept on fast disks and cached and tuned for performance as much as possible. Warm data can be held on slower, cheaper storage but is still online accessible. Cold data is typically held on tape. The reason for storing data in 3 separate areas is that hot storage is comparatively expensive, and you don’t want your hot data slowed down by large volumes of warm and cold data which are rarely or very rarely accessed.
- Technical metadata management. As data moves from source to being consumed by a target system, it is transformed along the way. It is common for an end user to enquire as to the source of a particular field in a particular report and what transformation steps were applied to the data. A technical metadata management system provides this information.
- A business glossary is very important for an end user to understand what a particular field in a consuming system refers to. Although, you may imagine that something such as a date would be uniform, you can have calendar dates (can vary by country), fiscal (tax year) dates, accounting dates etc. You also may have to identify national holidays. Within the business glossary, the names of fields that appear in the consuming system along with descriptions and any business logic (e.g formulas or derivations) that has created that field is included. A simple example of business logic would be gross profit = revenue – costs. Costs were be another example of business logic. Essentially business logic are attributes and measures which an end user would understand and which have been derived from the raw data provided by the data sources.
- User authentication is required in order to secure access to your information management system. This, along with data authorisation allows or dis-allows access to certain data elements.
- An enterprise data hub which is simply a store which allows structured, unstructured and semi-structured data from the data sources to be stored in raw form and made accessible to other systems as quickly as technically possible. Consuming systems that may read data in this unadulterated state might include basic monitoring applications used by operations to check that the data sources are live.
- An enterprise data warehouse contains transactional and aggregated data in an integrated form which allows homogenous (similar) data from multiple data sources to be reported in a consistent manner. The transactional-level data held within the enterprise data warehouse is often separated out and referred to as an operational data store since transactional level data is typically used for operational, day to day reporting purposes. An example of an operational report might be a report which identifies all of the details for sales which had completed in the last day. The transactional data which is aggregated on a daily, weekly, monthly, quarterly or yearly basis is typically used for management reporting. For example, if you wanted to produce a report which showed revenue growth over the last 5 years, this would be best served from aggregated data.
- The data virtualisation layer can serve many purposes. It can join transactional data in the enterprise data warehouse with master data held in the master data source, it can incorporate additional transactional data held in other data sources which has not been included in the enterprise data warehouse, and it can also add business logic in to this layer, so that consistent business logic is provided to all consuming systems.
- The data masking layer allows rows or columns of data to be redacted (partially or fully overwritten/scrambled) so that sensitive data is not seen by users who are not authorised to see it.
- A data services layer solves 2 problems – It provides a standard API to consuming systems which abstracts the data access from the way that it is stored in the physical data store. This allows a data architect to change the structure of the underlying data store without affecting the consuming application. Secondly, by providing a distinct set of data services only, it provides for more secure access to the data than allowing a consuming system general access to the physical data store.
In my next article, I will discuss where a Big Data solution might fit in to the overall information management system architecture.