Creating a Business Glossary

Why do you need a business glossary?

Most companies tend to buy systems that meet particular functional needs e.g. sales, marketing, hr, finance etc. Users of these systems then start to learn the terminology which is peculiar to that particular system. Problems arise when there is a need to report on data stored across multiple systems. In this situation, there may be many systems from which you wish to pull information from, each of which have their own dialect.

Identifiable business benefits include:-

  • Clear reporting – field labels in reports are typically abbreviated and often the same term can lead to a multitude of labels. By linking a field label to a business glossary term, it becomes clear what the report is telling you.
  • A clear understanding of each other’s terminology allows different functional areas to work together more effectively – it reduces the ‘Tower of Babel’ problem.
  • It helps to identify subject matter experts within the business – the authors of the terms are likely to be those with the best knowledge as to how systems and reports work.
  • It allows information about business rules e.g. conditional logic, KPIs and format enforcement to maintained centrally rather than hidden within systems
  • Gets rid of system-specific terminology which becomes problematic when the organisation needs to change systems.

Is a business glossary simply a dictionary?

A business glossary is more like a thesaurus than a dictionary. A dictionary provides multiple meanings for the same sounding word e.g. volume may refer to loudness of sound or it may refer to capacity of a liquid or space, whereas a thesaurus provides multiple words with the same meaning e.g. volume, bulk, cubic measure are different terms used for a capacity of a liquid or space.

Who should create a business glossary?

Bear in mind that people don’t work in isolation but fit in to a particular community with a language developing that serves a particular purpose. With that thought, you need to identify those communities within your organisation. Typically, they’ll cluster around specific functional areas e.g. marketing, sales, hr, finance and different geographies. Within this community, there will be people who already face the task of needing to understand terms across multiple systems e.g. management reporting faces this task. So these will be your natural author community.

What features should your business glossary have?

The following features will lend themselves to providing a glossary which is practically useful to your organisation:-

  • Cost effective – It’s really important that the cost of the business glossary doesn’t outweigh the benefits that you expect to get from it.
  • Usable – This is again a top priority. If you want subject matter experts to compile a business glossary for your organisation, then you’ve got to choose a product that they want to actually use. It shouldn’t feel like a chore to add information to the glossary
  • User & Group roles – you want to allow as many people as possible to add candidate terms & definitions to the glossary, but you still need the concept of an editor to approve those terms for approval, to ensure quality.
  • Multiple users – many users need to be able to search for, add and approve content of the glossary
  • Auditing – you’ll need to ensure that any changes to content is audited, since you’ll have links to your business glossary from systems and reports, so you don’t want the content to be changed in an adverse way without knowing who did it.
  • Workflow mechanism – you’ll need an approval process for the content
  • Terms & definitions – Need to be able to create terms and definitions to produce rich content
  • Related terms – need to be able to relate terms produced by different functional areas to each other, so that the aim of improving understanding can be achieved.
  • Taxonomies (classification schemes) – This covers 2 areas – you need to be able to group your terms in to structures so that you can drill up and down the structure in a logical way. It’s also useful for the business glossary to hold information about how to classify data in your systems. For example, the Dewey Decimal Classification system is a well known way of labeling books, so that a book can be easily found in a library.
  • Code sets (list of values) – The business glossary can hold pre-defined lists for certain data items. e.g. ISO provide lists of countries & currencies. by using pre-defined code sets, it makes it easier to integrate your data with other system’s data both externally & internally.
  • Import/export metadata – although having an online searchable business glossary is a great step forward for a business, there is typically a need to export the content so that it can be processed elsewhere and to import content from industry standard material or glossaries provided by systems or previously produced within the organisation.
  • Search engine – It’s often difficult to know exactly the term that you’re searching for, so having a Google style search capability, makes it easier to discover relevant content.
  • Business rules – Conditional logic, Key performance indicators/measures and validation rules e.g. email should be of format <name>@<company url> are important for ensuring that data is standardised and consistent which improves data quality and integration
  • Supportable – The IT team will need to be able to back up the glossary and support the system without extensive additional training being required.
  • Hyperlink from a reporting tool/system – A key benefit is to be able to provide hyperlinks from system or report fields to terms within the business glossary. This eliminates the ambiguity of meaning that typically uncontrolled abbreviated field labeling can cause.

Other features that you might wish to consider include:-

  • Linking to fields in your data modeling tool. It can be useful to link data elements in your data model to business terms when those data elements will be made visible to business users.
  • Linking to validation rules from a data quality/master data management tool – this means that you don’t need to replicate validation rules with data quality systems, that have been defined in your business glossary already
  • Available in the cloud – The business glossary needs to be shareable by multiple users. Having it on a network drive achieves that, but by putting it on a public or private cloud, the business glossary will be available from multiple media and from home and work, providing the ability to research terms at will.

Where can I get a business glossary?

The simplest business glossary tool could be created using Microsoft Access with export capabilities to Excel. Bear in mind, however, that such an approach would meet few of the features listed above, without extensive development required. Cost is more than just the initial purchase costs, it also includes the development and support (including maintenance & training) costs.

Business Glossary tools which have more of the necessary features (thereby reducing development costs) include:-


London Big Data Events – 29th June to 6th July 2015

Jun 28th 6:30pm The career roadmap of a SQL professional and Betfair’s journey to Big Data Cloud
O2 Workshop
229 Tottenham Court Road, W1T 7QG, London (map)

Jun 28th 6:30pm
Fraud Detection using Neo4j
Neo Technology
5 – 11 Lavington Street, London, SE10NZ, London (map)

Jul 1st 6:30pm
8th Spark Meetup: Share and analyse genomic data at scale with Spark […]
Campus, powered by Google
4-5 Bonhill Street, London (map)

Information Management Roles & Responsibilities

As Information Management has become more complex over the years, associated roles & responsibilities have been created to fill the gaps. The job titles may vary from organisation to organsation, but the generic roles should stay the same. The list below is confined to the development of information, and excludes change management, management and test roles. The article’s purpose is to assist people working with information management specialists in understanding the purpose of each person’s role.

Data Scientist

  • Creates algorithms using the empirical method to detect patterns e.g. clustering and correlation within data and attributes a meaning to those patterns

Data Engineer

  • Ensures that data is of sufficient quality that a data scientist can derive accurate results from the data

Data Analyst

  • Examines the data held in data sources via data profiling (gathering information about structure, content & meaning)

Database Administrator

  • Creates a database
  • Creates the database objects (tables, views, indexes, synonyms etc.)
  • Tunes any performance issues by changing database parameters, gathering statistics and tuning the sql
  • Creates database users and groups and grants access
  • Deploys code to a particular environment

Database Developer

  • Creates code (user functions and stored procedures) which runs within a particular database

Data Modeler

Data Architect

  • Defines naming standards and data domains for a particular database project
  • Directs a team that creates the ETL mapping scripts which moves data from a source to a target
  • Directs the team of data analysts who profile the data
  • Decides the philosophy and methodology to be adopted for the data project
  • Designs the logical and physical data models

Extract, Transform & Load (ETL) Lead

  • Manages a team of ETL developers who code the ETL mapping scripts
  • Creates control flows to ensure that ETL scripts run in a set order and run either in parallel or sequentially

Business Intelligence Architect

  • Decides which business intelligence (reporting and analytics) tools to use on a particular information management project
  • Manages the team that implements the user and group security policies to restrict access to particular reports
  • Manages the team of report developers
  • Creates a schema within a business intelligence tool which converts data sourced from a database in to information in a structure, content and meaning understandable to the business, which can then be used in a number of reports/applications/data extracts
  • Links attributes and measures to the business glossary

Solution Architect

  1. Business architect – defines business capabilities and the target operating model
  2. Enterprise data/information architect – defines enterprise-wide data standards and policies, maintains reference data models, classifies data in terms relevant to the business
  3. Application/systems architect – creates the basic building blocks (system components) and how they interface with each other, so that an application or system can be created.
  4. Technology/infrastructure architect – selects the hardware and software which should be used to meet a particular function across the enterprise. Defines production and non-production environments
  5.  Security architect – ensures that there are appropriate policies in place to prevent removal or disclosure of sensitive data. Classifies data in terms of sensitivity. Ensures that security policies are implemented correclty. Manages a team that tests that information security can not be breached.
  • Works with the delivery team, ensuring that all aspects of the solution architecture are covered prior to going in to production.

Enterprise Data Architect

  • Responsible for defining naming standards, policies & principles to be used by data projects across the enterprise in order to ensure the data projects conform to a common vision
  • Maintains an enterprise-wide conceptual data model showing the range of data grouped by business capability that exists within the enterprise

Information Architect

Data Governance Manager

  • Defines a business process for approval of users access to sensitive data held within databases
  • Runs the data governance council, which brings together representatives of the business to agree on:-
  1. The names and definitions of key business elements (important attributes and measures used in reporting)
  2. Creation of hierarchies in order to group data appropriately
  3. Data quality rules and what to do when a data analyst has identified problems with data quality

Data warehousing philosophies & methodologies

Data warehousing refers to the need to integrate related data from your data sources, capture change history, group entities by subject area and then retain in a physical data store.


2 key philosophies have emerged. The original philosophy championed by Ralph Kimball & Bill Inmon adheres to the definition of data warehousing above and delivers a data warehouse that contains a ‘Single version of the truth’ i.e. Any systems which use the data in the data warehouse will produce consistent results.

A more recent philosophy championed by Dan Linstedt is that there is no such thing as a ‘Single version of the truth’, since business stakeholders have different perspectives i.e. multiple versions of the truth. As such this philosophy believes that all that can be achieved is a ‘Single version of the Facts’ which means that we should collect data from all of the data sources, and store them in an unintegrated form in an enhanced enterprise data hub. A data mart (within a data virtualisation layer) can then be produced for each business function which mirrors their perspective on what the business truth is.


In order to provide data in a form suitable for reporting, the following stages need to take place:-

  • Acquire – Data needs to be captured from the data sources
  • Clean – Data needs to be standardised and duplicates removed.
  • Integrate – Data from multiple data sources needs to be integrated
  • Add Business Logic – Raw data captured from source needs to have additional business logic. Examples of this are that each business function will have their own ways of measuring their function’s performance, so we need to create calculated measures so the the business function’s performance can be reported on. In a similar fashion, some of the attributes familiar to the business may need to be derived from the raw attributes captured by the data source, and data hierarchies added.

Bill Inmon

Bill Inmon's System Architecture

Bill Inmon introduced the original concept of a data warehouse. Prior to this, reporting systems would often run against a copy of the production database or against the production database itself. Bill identified that we need a new data store (a data warehouse) which is modeled differently so that reports can be produced effectively.

Bill’s proposal was that all of the data from all of the data sources should be integrated in to an enterprise data warehouse, and then data marts (within a data virtualisation layer) should be created from the enterprise data warehouse for each business function

Bill believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.


  • Data modeling is in 3rd Normal Form in Bill Inmon’s data warehouse, so storage is efficient
  • Since data marts are created after the enterprise data warehouse stage, these could be created in virtual form (using data virtualisation technologies like Composite or database views) or in physically stored. This is a notable advantage over Ralph Kimball’s methodology


  • Modeling the entire enterprise works fine, if you start to produce an enterprise data warehouse when your company is still very small. However, for a large organisation attempting to model the complete enterprise leads to data warehouse projects running for years without delivering anything.

Ralph Kimball

Ralph Kimball System Architecture

Ralph Kimball came up with the dimensional modeling (aka star schema) technique which is very easy to grasp from a business user’s perspective. A data mart modeled as a star schema consists of a central fact tables containing a business function’s calculated measures, diced and sliced by the dimension’s surrounding it. Ralph Kimball proposed that an enterprise data warehouse could be created by creating data marts for each business functional area, but ensuring that each dimension was conformed (i.e. made identical) so that it could be used by all data marts that make use of it. The methodology involves producing a data mart for a single business function first, and then adding a 2nd data mart for another business function and gradually building up a constellation of data marts which represent the entire enterprise

Like Bill, Ralph believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.


  • Since the data warehouse consists of a set of data marts, there is no need for an additional data virtualisation layer (which Bill Inmon and Dan Linstedt acknowledge is required to meet the needs of a business function).
  • Since the methodology allows for a single data mart to be created for a single business function initially, for large organisations, KImball projects deliver results to the business much quicker than Bill Inmon’s methodology that requires that the entire organisation be modeled first.


  • Since a data warehouse is being built in an incremental fashion and dimensions need to be conformed in order that calculated measures can be joined from different star schemas, dimensions need to be re-modeled as each new business function is analysed. The remodeling effort can be expensive.
  • Since the business logic is baked in to the data marts (within the enterprise data warehouse), the methodology is inflexible to changes in business logic.

Dan Linstedt

Dan Linstedt System Architecture

Dan Linstedt’s methodology is based on his concept of an enterprise data warehouse effectively being an enhanced enterprise data hub with data lineage and change history added, and data retained for longer than would normally be the case for an enterprise data hub. Integration of data from the data sources takes place when a data mart (within the data virtualisation layer) is created for each business function.

Dan believes that all data should be cleansed at source.


  • Since data integration doesn’t take place until the data virtualisation layer, it is comparatively quick to create what Dan calls a data warehouse, which I’d argue is in fact an enhanced enterprise data hub.
  • Dan’s enhanced enterprise data hub stores all of the data captured from the data sources and tracks data lineage and change history, so captures an unadulterated history of what happened at source. This is different from Bill and Ralph’s methodology that integrate data from data sources, thus altering the original form of the data, prior to persisting in a data warehouse.


  • Since data integration takes place in each data mart for each business function, and the dimensions within each data mart are not conformed, it is not possible to produce consistent reports across the entire organisation.
  • If data cannot be cleansed at source, Dan’s methodology doesn’t have anywhere within it, where cleansing can otherwise take place. This is a significant disadvantage since additional storage is required for duplicated records, and poor quality data will distort reported results.

Digital Marketing is watching you!

I’ve been at a number of London technology events this week, getting a handle on the latest trends related to Big Data, the most exciting of which is Digital Marketing. This scene has exploded over the last 10 years. Nowadays, the marketeers are capturing, amongst other things:-

1. Where you’re physically located – anybody with a smartphone can be tracked via triangulation of mobile network masts or wifi routers within cities. Your ip address can be reverse looked up to provide your actual address information.

2. Which device you’re using – tablet, PC, mobile phone etc.

3. Which language you communicate in

4. Which items you’ve liked on facebook

5. Who your connections are, and what they like.

6. How long a user has stayed on a particular web page and which part of the web page they spent longest on.

7. What you’ve posted on social media like twitter

8. Which advertisements you clicked on. What style of ads appeal to you. What keywords, what colours, what content.

9. Where you go for your fitness runs – anybody whose got a fitbit or similar device often broadcasts this information to the world.

10. What search engine you use.

11. Where you shop, what you bought and what payment method you use.

These kinds of data are now being accumulated on data management platforms, allowing the marketeers to provide fine-grained personalised content to you. This has benefits to the advertisers in that better targeted ads will result in more sales, but also better targeted ads will be less expensive, since keywords which attract a broad audience are priced much higher than those that a few people would respond to. For example, “Hotel London” currently costs £7 per click on Google Adwords…..It’s been commented that it would be cheaper to offer to buy somebody a meal at McDonalds than it is to have them click on such an Ad at these kind of rates.

With so much information to hand, it is becoming physically impossible for marketeers to actually be able to analyse the data manually within a short enough timescale, so increasingly data scientists are writing algorithms (code to fulfil a particular use case) to crunch the numbers and either recommend or sometimes deliver changes to ad campaigns in order to find the sweet spot that maximises clickthroughs. Data scientists are encouraged to compete with each other to discover insights in to data on sites like

In future even more data will be collected from you. With the Internet of things, information about what you actually ate at home can be collected, for example, as your fridge will know what’s been put in and what has been taken out.

In recent years, Digital Marketing has progressed from Search Engine Optimisation (SEO), through Pay Per Click (PPC)  to now looking at things such as Social Media Optimisation (SMO) and Conversion Rate Optimisation and will continue to fragment in to specialist areas providing deep insights in to consumer behaviour.

The latest ideas are to avoid going for quick sales but to spend time raising brand awareness by leveraging influencers to promote the brand. Bloggers, Youtube content providers, top tweeters etc. are invited to free events by advertisers in anticipation that a certain percentage of them might promote the brand indirectly.

Although, this may all seem very ‘Big brother’, there are many positive things from this. Ads which actually provide something that you want will be less annoying. Everybody has the ability to become an influencer in their area of interest, so this should make it easier to earn a living doing something that you actually enjoy. In future, you could be advised that you’re running low on certain food supplies and a re-ordering facility could be set up.  Your health could also be monitored, your genomes sequenced and appropriate medicines specific to your unique characteristics produced. These kinds of things will free up your time to do more interesting things, and improve your wellbeing.

Is there much that you can do to stay ‘under the radar’? If you wanted to, yes, you could refuse to allow tracking cookies  or supply any of your details if they were to be passed on to a 3rd party. You could decide not to use a smartphone. You could not ‘like’ any product on facebook or communicate on social media. Effectively, you could live as people did quite happily in the 20th century. The problem is that the world in the 21st century will be geared up to you supplying this information. Much as you would find it hard to obtain credit without a credit history, without supplying information, you would be outside all of the networks which could be used to find work, hobbies, events, contacts, friends, health services etc. For all the negatives that a lack of privacy entails, there will be many more positives which should make your life easier.

Further information

To understand how digital marketeers go about tracking users, why not take a look at the How to run an online advertising campaign article.

Cleaning your data

From the Profiling your data article, you should have a clear idea as to the structure, content and meaning of the data that you want to use. However, knowing that you have data quality problems requires you actually doing something about it. The term for fixing data quality issues is known as data cleansing.
Note: Other terms for this include data scrubbing, data sanitisation, data munging or data wrangling.

Whichever term you want to use, it involves 3 steps:- standardising your data, matching records, merging records.

Standardising your data

This refers to the need to ensure that all of your data conforms to your business’s validation rules, patterns and formats.

For the more complex problems such as address standardisation, it is more cost effective to utilise the services of organisations who specialise in capturing organisation or individual address information. Dun & Bradstreet are well known for organisational address data and Experian Data Quality can provide consumer data cleansing services with good coverage of US and UK consumers

Matching your data

Once your data is in a standard form, you will need to identify the primary key (unique set of identifying fields) in your records. You can use the following features from the data profiling section to establish a set of attributes which are suitable as a primary key:-

1..Distinct count  (which will tell you how many distinct occurrences of those attributes occur in a set of records)

2. Duplicate count (which will tell you how many of those distinct occurrences occur more than once).

3. Unique count (which will tell you how many of those distinct occurrences occur just once).

The closer that your unique count is to your distinct count, the better your identification of a primary key is. If you are selecting textual fields as candidates for primary keys, you will also need to consider whether exact matching is appropriate or fuzzy matching. An example of fuzzy matching is where names that sound similar are treated as the same e.g. Dereck and Derek would match.
Note: Unless you’re exceptionally lucky, you’re unlikely to get 100% uniqueness as you’ll inevitably have primary keys duplicated in some records. For example, if you have the same customer entered either in the same system or in 2 different systems, they are likely to have duplicate primary keys.

Merging your data

Finally, now that you know what your primary keys are, you need to produce a composite record which pulls in the best quality data from each of the records which share the same primary key. You can use information such as the Null count and Maximum length statistics from the data profiling section to determine which attribute in which record is likely to have the best quality of data. For example, if you have middle name field, the data source that has more entries for this field is likely to provide better quality data, and if that field contains the full name rather than just an initial, this may also be a good indication of quality.
Note: When you merge records that share a duplicate primary key, you should keep the original records and create a pointer to your new golden source of data. This is important in case there are any queries about how reliable your data cleansing exercise was.