In order for data to have any value, we have to ensure that it’s of good quality and standardised.
The process of doing this is known as data governance.
Why do we need this?
Imagine you create your startup company and create a simple shopping cart website, where a customer can purchase 1 or more books and then pay for the basket. Your data architect creates a data model that meets your needs and this is physically implemented in your database. At this point you have a single source for customer.
You then create a customer support function, and your web developers haven’t got time to create a database for customer support, so they hire in another database team, who create a 2nd data model consisting of a customer and support calls. At this point, you have 2 separate data models. Both teams are happy with their work and able to produce reports demonstrating how good a job they’ve been doing.
The CIO then gets rung up by a customer who tells him that he’s purchased 150 books from the company over the last 6 months, and that every time he rings customer support, it takes them ages to find him on the system and he’s going to stop placing orders if this continues. The CIO asks the data analyst to investigate what’s going on. He looks at both systems and sees that customer attributes and details don’t tally up.
During the analysis exercise, the data analyst also informs the CIO that data security hasn’t been set up by the team who set up the ecommerce website, and that customer details could be obtained relatively easily if a hacker gets in to the system. Since data isn’t being audited, there’s currently no way of ascertaining whether data has been hacked or not.
So how would you go about resolving this?
Pictorially, these are the areas of the modern information management systems architecture that ensure good data governance:-
The steps are:-
Analyse the data in your source systems
- Document the names and definitions of the attributes as they exist on both source systems
- Have the data analyst profile the data in both systems i.e. investigate and report on the state of the data in both systems in detail. This means looking to see where data has unexpected values for example empty attributes where they should be filled or additional values.
- Write down the data quality issues that have been established from the profiling exercise and establish workarounds e.g create automated cleansing rules, establish manual fixing procedures or simply note down that there is a known issue.
Analyse the data in your target reports and applications
- From existing reports and applications that use the 2 source systems, establish a complete list of fields contained within them, together with business definitions.
- Trace the fields back to their source
- Work out which fields are synonyms of each other and document this
Create a governed business data slide deck
- Group attributes in to subject areas e.g. customer, book, payment, support calls (based on the business capabilities that were identified in the how to develop an information strategy article
- Create slide decks for each subject area showing in business terms the data items that are in the subject area, any hierarchies and any known data quality issues. Highlight use cases for the data to make it gel with the business.
Engage with senior business stakeholders and form a data governance council
- Firstly, you need to identify potential candidates to be data stewards. In a small company, this will be relatively straightforward. For example, product might sit with marketing, customer could sit with legal or service operations, billing would sit with finance, performance reviews and employees with HR etc. For a larger organisation, things can be much trickier, especially if you’re dealing with a global organisation were you can’t easily contact potential candidates. The best way that I’ve found for establishing candidates is to talk to Architecture or Change management, and find out from them who the main program sponsors have been for large projects in particular areas. If you have a business architect in your organisation, he should be your first port of call. Note: There will always be some business data that has to have IT as an owner, for example, calendar dates don’t have a natural owner (although financial calendar dates would have).
- Appoint data stewards (senior business stakeholders responsible for each subject area)
Form a Data Steward Forum
- Agree draft definitions for each of these attributes (descriptive fields) and measures (fields that are summed) and get approval from the data steward forum.
Create a reference data model
- Create a reference data model that correctly models the relationships between each of the items in the subject areas
CREATE A BUSINESS GLOSSARY
- Populate a business glossary with attributes and measures (aka Key Performance Indicators (KPIs)) which defines the names and descriptions, classifies the content and provides synonyms for names where business units require different terminology to be used.
Create a master data store and master data management system
- Create a master data store and build it based on the reference data model
- Identify which source data is the golden copy of each data item and build work flows to push this data in to the master data store.
- Build a master data management application that allows the master data to be manually fixed.
- Create automated data cleansing to fix known problems that occur at source.
- Create alerts whenever dirty data needs to be fixed.
- Create mapping tables to map the records in the source system to records in the master data store.
Include links to your master data in all existing and new reports and applications
- Update your existing reports to include not only links to the business data you have in your source systems, but also links to the business data held in the master data store.
- Ideally, hyperlink fields in the reports to entries in your business glossary, so that it is clear as to which attribute or measure a particular field label is referencing.
Set up Audit & Reconciliation Controls
Create audit (calculating checksum at source) and reconciliation (calculating checksum at target and reconciling with source) controls to ensure no data is lost between each stage of data integration.
Set up user and data security
You need to prevent access to systems. This can either be via a username and password, ideally via a centralised user authentication system, or more robustly by adding public/private key encryption, so only holders of the private key can gain access. Any masking rules that you wish to create to dynamically scramble sensitive data from certain groups of users under defined circumstances should be created in the data masking layer.
Turn on database auditing so that all access to data is logged, in case an external hacker or inside man overcomes the other security controls.
Establish a business process whereby users can apply for access to data
You’ll need a process where a user requests access to a system (user authentication), sets of data and particular reports (user authorisation).
What have we achieved?
Once you’ve done this, all reports and applications will be linked to a common view of business data, and your business data will have definitions, so you have a consistent, standardised view of business data. This means that the next time the customer rings up, he can be traced by using the unique id assigned to his record in the master data store, details of which can be supplied to the customer before hand.
Additionally, you’ve put in sufficient authentication and authorisation to prevent a security breach, plus audit & reconciliation controls to ensure that no data is lost, plus audit controls to ensure that all access to your data is logged.
Is there an Apache project covering data governance?
Currently, the Apache community has been slow on filling data governance gaps in the Hadoop/Spark ecosystems. However, there an incubator project – Apache Atlas – has now been created to create an open source system which will provide audit, metadata, data security and data life cycle management.