London Big Data Events – 29th June to 6th July 2015

Jun 28th 6:30pm The career roadmap of a SQL professional and Betfair’s journey to Big Data Cloud
O2 Workshop
229 Tottenham Court Road, W1T 7QG, London (map)

Jun 28th 6:30pm
Fraud Detection using Neo4j
Neo Technology
5 – 11 Lavington Street, London, SE10NZ, London (map)

Jul 1st 6:30pm
8th Spark Meetup: Share and analyse genomic data at scale with Spark […]
Campus, powered by Google
4-5 Bonhill Street, London (map)

Advertisements

Information Management Roles & Responsibilities

As Information Management has become more complex over the years, associated roles & responsibilities have been created to fill the gaps. The job titles may vary from organisation to organsation, but the generic roles should stay the same. The list below is confined to the development of information, and excludes change management, management and test roles. The article’s purpose is to assist people working with information management specialists in understanding the purpose of each person’s role.

Data Scientist

  • Creates algorithms using the empirical method to detect patterns e.g. clustering and correlation within data and attributes a meaning to those patterns

Data Engineer

  • Ensures that data is of sufficient quality that a data scientist can derive accurate results from the data

Data Analyst

  • Examines the data held in data sources via data profiling (gathering information about structure, content & meaning)

Database Administrator

  • Creates a database
  • Creates the database objects (tables, views, indexes, synonyms etc.)
  • Tunes any performance issues by changing database parameters, gathering statistics and tuning the sql
  • Creates database users and groups and grants access
  • Deploys code to a particular environment

Database Developer

  • Creates code (user functions and stored procedures) which runs within a particular database

Data Modeler

Data Architect

  • Defines naming standards and data domains for a particular database project
  • Directs a team that creates the ETL mapping scripts which moves data from a source to a target
  • Directs the team of data analysts who profile the data
  • Decides the philosophy and methodology to be adopted for the data project
  • Designs the logical and physical data models

Extract, Transform & Load (ETL) Lead

  • Manages a team of ETL developers who code the ETL mapping scripts
  • Creates control flows to ensure that ETL scripts run in a set order and run either in parallel or sequentially

Business Intelligence Architect

  • Decides which business intelligence (reporting and analytics) tools to use on a particular information management project
  • Manages the team that implements the user and group security policies to restrict access to particular reports
  • Manages the team of report developers
  • Creates a schema within a business intelligence tool which converts data sourced from a database in to information in a structure, content and meaning understandable to the business, which can then be used in a number of reports/applications/data extracts
  • Links attributes and measures to the business glossary

Solution Architect

  1. Business architect – defines business capabilities and the target operating model
  2. Enterprise data/information architect – defines enterprise-wide data standards and policies, maintains reference data models, classifies data in terms relevant to the business
  3. Application/systems architect – creates the basic building blocks (system components) and how they interface with each other, so that an application or system can be created.
  4. Technology/infrastructure architect – selects the hardware and software which should be used to meet a particular function across the enterprise. Defines production and non-production environments
  5.  Security architect – ensures that there are appropriate policies in place to prevent removal or disclosure of sensitive data. Classifies data in terms of sensitivity. Ensures that security policies are implemented correclty. Manages a team that tests that information security can not be breached.
  • Works with the delivery team, ensuring that all aspects of the solution architecture are covered prior to going in to production.

Enterprise Data Architect

  • Responsible for defining naming standards, policies & principles to be used by data projects across the enterprise in order to ensure the data projects conform to a common vision
  • Maintains an enterprise-wide conceptual data model showing the range of data grouped by business capability that exists within the enterprise

Information Architect

Data Governance Manager

  • Defines a business process for approval of users access to sensitive data held within databases
  • Runs the data governance council, which brings together representatives of the business to agree on:-
  1. The names and definitions of key business elements (important attributes and measures used in reporting)
  2. Creation of hierarchies in order to group data appropriately
  3. Data quality rules and what to do when a data analyst has identified problems with data quality

Data warehousing philosophies & methodologies

Data warehousing refers to the need to integrate related data from your data sources, capture change history, group entities by subject area and then retain in a physical data store.

Philosophies

2 key philosophies have emerged. The original philosophy championed by Ralph Kimball & Bill Inmon adheres to the definition of data warehousing above and delivers a data warehouse that contains a ‘Single version of the truth’ i.e. Any systems which use the data in the data warehouse will produce consistent results.

A more recent philosophy championed by Dan Linstedt is that there is no such thing as a ‘Single version of the truth’, since business stakeholders have different perspectives i.e. multiple versions of the truth. As such this philosophy believes that all that can be achieved is a ‘Single version of the Facts’ which means that we should collect data from all of the data sources, and store them in an unintegrated form in an enhanced enterprise data hub. A data mart (within a data virtualisation layer) can then be produced for each business function which mirrors their perspective on what the business truth is.

Methodologies

In order to provide data in a form suitable for reporting, the following stages need to take place:-

  • Acquire – Data needs to be captured from the data sources
  • Clean – Data needs to be standardised and duplicates removed.
  • Integrate – Data from multiple data sources needs to be integrated
  • Add Business Logic – Raw data captured from source needs to have additional business logic. Examples of this are that each business function will have their own ways of measuring their function’s performance, so we need to create calculated measures so the the business function’s performance can be reported on. In a similar fashion, some of the attributes familiar to the business may need to be derived from the raw attributes captured by the data source, and data hierarchies added.

Bill Inmon

Bill Inmon's System Architecture

Bill Inmon introduced the original concept of a data warehouse. Prior to this, reporting systems would often run against a copy of the production database or against the production database itself. Bill identified that we need a new data store (a data warehouse) which is modeled differently so that reports can be produced effectively.

Bill’s proposal was that all of the data from all of the data sources should be integrated in to an enterprise data warehouse, and then data marts (within a data virtualisation layer) should be created from the enterprise data warehouse for each business function

Bill believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.

Advantages:

  • Data modeling is in 3rd Normal Form in Bill Inmon’s data warehouse, so storage is efficient
  • Since data marts are created after the enterprise data warehouse stage, these could be created in virtual form (using data virtualisation technologies like Composite or database views) or in physically stored. This is a notable advantage over Ralph Kimball’s methodology

Disadvantages:

  • Modeling the entire enterprise works fine, if you start to produce an enterprise data warehouse when your company is still very small. However, for a large organisation attempting to model the complete enterprise leads to data warehouse projects running for years without delivering anything.

Ralph Kimball

Ralph Kimball System Architecture

Ralph Kimball came up with the dimensional modeling (aka star schema) technique which is very easy to grasp from a business user’s perspective. A data mart modeled as a star schema consists of a central fact tables containing a business function’s calculated measures, diced and sliced by the dimension’s surrounding it. Ralph Kimball proposed that an enterprise data warehouse could be created by creating data marts for each business functional area, but ensuring that each dimension was conformed (i.e. made identical) so that it could be used by all data marts that make use of it. The methodology involves producing a data mart for a single business function first, and then adding a 2nd data mart for another business function and gradually building up a constellation of data marts which represent the entire enterprise

Like Bill, Ralph believes that either data should be cleansed at source, or data should be cleansed after it’s been acquired and before it’s integrated in to the data warehouse.

Advantages

  • Since the data warehouse consists of a set of data marts, there is no need for an additional data virtualisation layer (which Bill Inmon and Dan Linstedt acknowledge is required to meet the needs of a business function).
  • Since the methodology allows for a single data mart to be created for a single business function initially, for large organisations, KImball projects deliver results to the business much quicker than Bill Inmon’s methodology that requires that the entire organisation be modeled first.

Disadvantages

  • Since a data warehouse is being built in an incremental fashion and dimensions need to be conformed in order that calculated measures can be joined from different star schemas, dimensions need to be re-modeled as each new business function is analysed. The remodeling effort can be expensive.
  • Since the business logic is baked in to the data marts (within the enterprise data warehouse), the methodology is inflexible to changes in business logic.

Dan Linstedt

Dan Linstedt System Architecture

Dan Linstedt’s methodology is based on his concept of an enterprise data warehouse effectively being an enhanced enterprise data hub with data lineage and change history added, and data retained for longer than would normally be the case for an enterprise data hub. Integration of data from the data sources takes place when a data mart (within the data virtualisation layer) is created for each business function.

Dan believes that all data should be cleansed at source.

Advantages

  • Since data integration doesn’t take place until the data virtualisation layer, it is comparatively quick to create what Dan calls a data warehouse, which I’d argue is in fact an enhanced enterprise data hub.
  • Dan’s enhanced enterprise data hub stores all of the data captured from the data sources and tracks data lineage and change history, so captures an unadulterated history of what happened at source. This is different from Bill and Ralph’s methodology that integrate data from data sources, thus altering the original form of the data, prior to persisting in a data warehouse.

Disadvantages

  • Since data integration takes place in each data mart for each business function, and the dimensions within each data mart are not conformed, it is not possible to produce consistent reports across the entire organisation.
  • If data cannot be cleansed at source, Dan’s methodology doesn’t have anywhere within it, where cleansing can otherwise take place. This is a significant disadvantage since additional storage is required for duplicated records, and poor quality data will distort reported results.

Digital Marketing is watching you!

I’ve been at a number of London technology events this week, getting a handle on the latest trends related to Big Data, the most exciting of which is Digital Marketing. This scene has exploded over the last 10 years. Nowadays, the marketeers are capturing, amongst other things:-

1. Where you’re physically located – anybody with a smartphone can be tracked via triangulation of mobile network masts or wifi routers within cities. Your ip address can be reverse looked up to provide your actual address information.

2. Which device you’re using – tablet, PC, mobile phone etc.

3. Which language you communicate in

4. Which items you’ve liked on facebook

5. Who your connections are, and what they like.

6. How long a user has stayed on a particular web page and which part of the web page they spent longest on.

7. What you’ve posted on social media like twitter

8. Which advertisements you clicked on. What style of ads appeal to you. What keywords, what colours, what content.

9. Where you go for your fitness runs – anybody whose got a fitbit or similar device often broadcasts this information to the world.

10. What search engine you use.

11. Where you shop, what you bought and what payment method you use.

These kinds of data are now being accumulated on data management platforms, allowing the marketeers to provide fine-grained personalised content to you. This has benefits to the advertisers in that better targeted ads will result in more sales, but also better targeted ads will be less expensive, since keywords which attract a broad audience are priced much higher than those that a few people would respond to. For example, “Hotel London” currently costs £7 per click on Google Adwords…..It’s been commented that it would be cheaper to offer to buy somebody a meal at McDonalds than it is to have them click on such an Ad at these kind of rates.

With so much information to hand, it is becoming physically impossible for marketeers to actually be able to analyse the data manually within a short enough timescale, so increasingly data scientists are writing algorithms (code to fulfil a particular use case) to crunch the numbers and either recommend or sometimes deliver changes to ad campaigns in order to find the sweet spot that maximises clickthroughs. Data scientists are encouraged to compete with each other to discover insights in to data on sites like kaggle.com

In future even more data will be collected from you. With the Internet of things, information about what you actually ate at home can be collected, for example, as your fridge will know what’s been put in and what has been taken out.

In recent years, Digital Marketing has progressed from Search Engine Optimisation (SEO), through Pay Per Click (PPC)  to now looking at things such as Social Media Optimisation (SMO) and Conversion Rate Optimisation and will continue to fragment in to specialist areas providing deep insights in to consumer behaviour.

The latest ideas are to avoid going for quick sales but to spend time raising brand awareness by leveraging influencers to promote the brand. Bloggers, Youtube content providers, top tweeters etc. are invited to free events by advertisers in anticipation that a certain percentage of them might promote the brand indirectly.

Although, this may all seem very ‘Big brother’, there are many positive things from this. Ads which actually provide something that you want will be less annoying. Everybody has the ability to become an influencer in their area of interest, so this should make it easier to earn a living doing something that you actually enjoy. In future, you could be advised that you’re running low on certain food supplies and a re-ordering facility could be set up.  Your health could also be monitored, your genomes sequenced and appropriate medicines specific to your unique characteristics produced. These kinds of things will free up your time to do more interesting things, and improve your wellbeing.

Is there much that you can do to stay ‘under the radar’? If you wanted to, yes, you could refuse to allow tracking cookies  or supply any of your details if they were to be passed on to a 3rd party. You could decide not to use a smartphone. You could not ‘like’ any product on facebook or communicate on social media. Effectively, you could live as people did quite happily in the 20th century. The problem is that the world in the 21st century will be geared up to you supplying this information. Much as you would find it hard to obtain credit without a credit history, without supplying information, you would be outside all of the networks which could be used to find work, hobbies, events, contacts, friends, health services etc. For all the negatives that a lack of privacy entails, there will be many more positives which should make your life easier.

Further information

To understand how digital marketeers go about tracking users, why not take a look at the How to run an online advertising campaign article.

Cleaning your data

From the Profiling your data article, you should have a clear idea as to the structure, content and meaning of the data that you want to use. However, knowing that you have data quality problems requires you actually doing something about it. The term for fixing data quality issues is known as data cleansing.
Note: Other terms for this include data scrubbing, data sanitisation, data munging or data wrangling.

Whichever term you want to use, it involves 3 steps:- standardising your data, matching records, merging records.

Standardising your data

This refers to the need to ensure that all of your data conforms to your business’s validation rules, patterns and formats.

For the more complex problems such as address standardisation, it is more cost effective to utilise the services of organisations who specialise in capturing organisation or individual address information. Dun & Bradstreet are well known for organisational address data and Experian Data Quality can provide consumer data cleansing services with good coverage of US and UK consumers

Matching your data

Once your data is in a standard form, you will need to identify the primary key (unique set of identifying fields) in your records. You can use the following features from the data profiling section to establish a set of attributes which are suitable as a primary key:-

1..Distinct count  (which will tell you how many distinct occurrences of those attributes occur in a set of records)

2. Duplicate count (which will tell you how many of those distinct occurrences occur more than once).

3. Unique count (which will tell you how many of those distinct occurrences occur just once).

The closer that your unique count is to your distinct count, the better your identification of a primary key is. If you are selecting textual fields as candidates for primary keys, you will also need to consider whether exact matching is appropriate or fuzzy matching. An example of fuzzy matching is where names that sound similar are treated as the same e.g. Dereck and Derek would match.
Note: Unless you’re exceptionally lucky, you’re unlikely to get 100% uniqueness as you’ll inevitably have primary keys duplicated in some records. For example, if you have the same customer entered either in the same system or in 2 different systems, they are likely to have duplicate primary keys.

Merging your data

Finally, now that you know what your primary keys are, you need to produce a composite record which pulls in the best quality data from each of the records which share the same primary key. You can use information such as the Null count and Maximum length statistics from the data profiling section to determine which attribute in which record is likely to have the best quality of data. For example, if you have middle name field, the data source that has more entries for this field is likely to provide better quality data, and if that field contains the full name rather than just an initial, this may also be a good indication of quality.
Note: When you merge records that share a duplicate primary key, you should keep the original records and create a pointer to your new golden source of data. This is important in case there are any queries about how reliable your data cleansing exercise was.

Profiling your data

In order for data to be useful, it needs to be of good quality. The first step in ensuring that you have good quality data is to profile your date i.e. examine your source data to determine it’s structure, content & meaning.

How do you do this?

If you have small amounts of data, it’s relatively easy to simply examine the data visually and note down any problems that you spot. You will also need to eyeball sample data in order to verify that you fully understand the meaning of certain data columns. However, in most circumstances writing automated scripts which assist with identifying structure and content problems, preferably using a product that specialises in this area, will be much more efficient. See Gartner’s Magic Quadrant for Data Quality tools, to choose the best data quality product for your organisation.

Why do we have poor quality data?

The main reason is because validation rules at data entry haven’t been created. This may be because your data source doesn’t allow validation e.g. free-form Excel, or because the application developer hasn’t put them in because of pressures on delivery times or wishing to improve the user experience by allowing them to enter the minimal data required to serve the core purpose.

 

Common content problems with data

Address entries

Nowadays, web sites typically ask you to enter your post or zip code and then do a lookup on the addresses and you choose which house you live at. This ensures that addresses are relatively standard. However, in the past, and if users choose to enter addresses in free form, problems occur in that users won’t put the same data in the same field. For example, you might have address line 1, address line 2, city, country as your address details. So if you live in flat 1a, 15 Cowper Street, London, UK. There are several ways that you can enter this information. You could put the flat and the street information in address line 1, or you could put the flat information in address line 1 and the street information in address line 2. You might abbreviate 15 Cowper Street to 15 Cowper St. You might enter UK, United Kingdom or England as the country etc.

Phone numbers

In the UK, a phone number could be entered as 0208 421 1343, 020 8421 1343, (020) 8421 1343, +44 208 421 1343 or a variety of other ways. Each country has their own phone formats, so validating phone formats in each country is challenging.

Date formats

In the UK, we tend to write dates in DD/MM/YYYY format e.g. 13/06/2015. The ISO Date standard is YYYY-MM-DD. Different countries/regions have their own format.

Email addresses

It’s common for people to enter the wrong email address, if they don’t want to be contacted, so if you’re trying to identify unique customers by using email addresses, you’ll need to validate the email address entries.

Spelling mistakes

Is Derk Johnson, the same as Derek Johnson. Probably, but you’lll need to verify that by looking at other information such as their address, email address, social security number, date of birth or passport number.

Missing values

If your entry form doesn’t make certain fields mandatory, users will typically skip the optional fields. If you made all of the fields mandatory, it would solve the problem, but the user may become so frustrated with having to enter data that is unnecessary to them that they may abandon their order. For this reason, reports that rely on the data being entered will suffer.

Data profiling to validate structure


As well as content validation measures specified above, your database administrator will also want to run profile tests to ensure that they structure the data store used for reporting correctly.

Distinct counts


By examining the distinct counts of 1 or more fields in a record and comparing it with the total number of records, you can determine whether you’ve found a primary key (unique identifier for your set of records). For example, you might think that first name and last name for your customer was a good unique identifier but when you do a distinct count of 1m customers, you find that this combination only provides 950,000 distinct customers, demonstrating that this choice of primary key is invalid.

Maximum Lengths


By finding the maximum length of each field, the DBA can ensure that columns are sized appropriately, reducing any wasted space which saves storage and processing costs.

Number Formats


Checking to see if a field marked as decimal or float actually contains integers will allow the DBA to choose the correct number data type. This may improve index performance, reducing query times.