Profiling your data

In order for data to be useful, it needs to be of good quality. The first step in ensuring that you have good quality data is to profile your date i.e. examine your source data to determine it’s structure, content & meaning.

How do you do this?

If you have small amounts of data, it’s relatively easy to simply examine the data visually and note down any problems that you spot. You will also need to eyeball sample data in order to verify that you fully understand the meaning of certain data columns. However, in most circumstances writing automated scripts which assist with identifying structure and content problems, preferably using a product that specialises in this area, will be much more efficient. See Gartner’s Magic Quadrant for Data Quality tools, to choose the best data quality product for your organisation.

Why do we have poor quality data?

The main reason is because validation rules at data entry haven’t been created. This may be because your data source doesn’t allow validation e.g. free-form Excel, or because the application developer hasn’t put them in because of pressures on delivery times or wishing to improve the user experience by allowing them to enter the minimal data required to serve the core purpose.


Common content problems with data

Address entries

Nowadays, web sites typically ask you to enter your post or zip code and then do a lookup on the addresses and you choose which house you live at. This ensures that addresses are relatively standard. However, in the past, and if users choose to enter addresses in free form, problems occur in that users won’t put the same data in the same field. For example, you might have address line 1, address line 2, city, country as your address details. So if you live in flat 1a, 15 Cowper Street, London, UK. There are several ways that you can enter this information. You could put the flat and the street information in address line 1, or you could put the flat information in address line 1 and the street information in address line 2. You might abbreviate 15 Cowper Street to 15 Cowper St. You might enter UK, United Kingdom or England as the country etc.

Phone numbers

In the UK, a phone number could be entered as 0208 421 1343, 020 8421 1343, (020) 8421 1343, +44 208 421 1343 or a variety of other ways. Each country has their own phone formats, so validating phone formats in each country is challenging.

Date formats

In the UK, we tend to write dates in DD/MM/YYYY format e.g. 13/06/2015. The ISO Date standard is YYYY-MM-DD. Different countries/regions have their own format.

Email addresses

It’s common for people to enter the wrong email address, if they don’t want to be contacted, so if you’re trying to identify unique customers by using email addresses, you’ll need to validate the email address entries.

Spelling mistakes

Is Derk Johnson, the same as Derek Johnson. Probably, but you’lll need to verify that by looking at other information such as their address, email address, social security number, date of birth or passport number.

Missing values

If your entry form doesn’t make certain fields mandatory, users will typically skip the optional fields. If you made all of the fields mandatory, it would solve the problem, but the user may become so frustrated with having to enter data that is unnecessary to them that they may abandon their order. For this reason, reports that rely on the data being entered will suffer.

Data profiling to validate structure

As well as content validation measures specified above, your database administrator will also want to run profile tests to ensure that they structure the data store used for reporting correctly.

Distinct counts

By examining the distinct counts of 1 or more fields in a record and comparing it with the total number of records, you can determine whether you’ve found a primary key (unique identifier for your set of records). For example, you might think that first name and last name for your customer was a good unique identifier but when you do a distinct count of 1m customers, you find that this combination only provides 950,000 distinct customers, demonstrating that this choice of primary key is invalid.

Maximum Lengths

By finding the maximum length of each field, the DBA can ensure that columns are sized appropriately, reducing any wasted space which saves storage and processing costs.

Number Formats

Checking to see if a field marked as decimal or float actually contains integers will allow the DBA to choose the correct number data type. This may improve index performance, reducing query times.


London Technology week – 15th to 19th June 2015

It’s that time of year again. Lots of events next week.

Register at London Technology Week

Big Data related events include:-

Big Data Analytics

  • When: 16 June 2015
  • Where: Hotel Russell, WC1B 5BE
  • Times: 8:00 am – 5:00 pm
  • Event Format: Conference
  • Cost To Attend: £0.00

Big Data Masterclass

  • Where: ExCeL London, E16 1XL
  • When: 16 June – 18 June 2015
  • Times: 10:00 am – 5:00 pm
  • Event Format: Workshop
  • Cost To Attend: £0.00

The Value of Big Data – What Does It Mean For Government?

  • When: 16 June 2015
  • Where: 10 St Bride Street, EC4A 4AD
  • Times: 10:00 am – 12:00 pm
  • Event Format: Workshop
  • Cost To Attend: £0.00

Data Science Insights – Building Brains: Learning from data

  • When: 17 June 2015
  • Where: The Royal Institution of Great Britain, 21 Albemarle Street, London W1S 4BS
  • Times: 4:00 pm – 5:30 pm

IBME MedTech Week 2015: Harnessing the Power of Nanotechnology, Mobile Phones and Big Data for Global Health

  • When: 17 June 2015
  • Where: Roberts Building, Sir Ambrose Fleming lecture theatre, University College London, Torrington Place, WC1E 7JE
  • Times: 6:00 pm – 7:00 pm
  • Event Format: Other
  • Cost To Attend: £0.00

From IoT to Predictive Analytics

  • When: 18 June 2015
  • Where: Cheapside House, 138 Cheapside, EC2V 6BW
  • Times: 6:15 pm – 9:00 pm
  • Event Format: Networking event
  • Cost To Attend: £0.00

Securing your Big Data

Securing your data from both internal and external breaches is increasingly important both in terms of the cost of loss of the initial data and subsequent reputational damage.

Securing your data can be grouped in to 2 broad categories:-

Preventative measures

The following areas should be covered to provide a robust security solution:-

User authentication

Ideally, you need a site-wide authentication system where user and group information can be maintained centrally. Each application in your organisation should be able to use a “single sign on” (SSO) so that users don’t have to re-enter their details for each application. For enhanced user authentication security, you can use techniques such as:-

1. Captcha (which prevents external attacks by computers designed to generate random usernames and passwords to try to gain access)

2. 2-stage authentication. This is the type of authentication that you’ll get for your bank account, where you have to not only enter your username & password, but also enter, for example, a generated key code which is updated each minute.

Authorisation (Data & Reports)

Once a user is authenticated, data & reports need to be secured so that only certain users have access to that information. Typically, users are assigned to groups or roles, and the authorisation privileges are then assigned to the group/role. Privileges can allow users to create, read, update and/or delete data, dependent on what they’ve been assigned to.

Dynamic Data Masking

Dynamic data masking refers to the scrambling, masking or removal of data at a row or column level from a query result set.

Row level security

As well as user’s being authorised to see data at a table or report level, privileges can be assigned to certain rows of data. For example, if you have a database that contains customer data from all over the world, but for legal reasons only greek users can see greek customer data, then a row-level filter can be applied to prevent non-greek users seeing any customer data whose country of residence is Greece.

Column level security

In a similar fashion, column level security can be imposed so that sensitive data is not presented to users who are not entitled to see it. For example, your internal investigations team may want to see sensitive customer payment card details, and the sensitive data may be located in columns within a table that also holds non-sensitive data that would be useful to the rest of the organisation. In this situation, you need to be able to redact/scramble sensitive data when non-investigations staff run a query against the customer payment card table.

Approval process

For user authentication & authorisation to work properly, you will need a business process in place that ensures that approval isn’t granted without necessary stakeholder involvement.

Network Encryption

Another situation where data breaches are prevalent are in the transmission of sensitive data including username and password across unencrypted networks. By using secure network protocols which utilise encryption, this can be prevented.

Database, export & backup encryption

To avoid situations where a username or password may have been obtained or data has been exported or backed up to tape, extra security can be imposed by encrypting data at rest within your database, any exports and any backups.

Take care to ensure that your solution can cope with indexing and foreign key constraints and that the decryption process does not significantly affect performance.

Ensure that your master encryption key is stored securely, otherwise can’t gain access to data or backups!

Permanent Data Masking

This will be required when you move production data to a non-production environment. Production environments generally have tight access controls with only a small number of vetted individuals being given access, whereas non-production environments typically have a very open access with 3rd party suppliers coming on and off the project frequently.


User authentication and authorisation to data should be periodically reviewed, so as to prevent users still having access who have left or changed roles within the organisation.


Although the preventative steps should prevent unauthorised access to your data, you should also put in to place processes that will allow you to detect user activity so that you can identify culprits and what they have been able to do with your data.

User authentication logs and reporting

Repeated password failures should be logged and reported to the security team

User activity (audit) logs and reporting

All read and write activity which occurs against the data should be logged.

What are NoSQL databases?

A term which is often spoken about alongside “Big Data” is NoSQL (“Not Only SQL”). Most NoSQL databases share the distributed, horizontally scalable, high availability features of the Hadoop/Spark stack, but allow random access, thus allowing low latency querying of individual records.

NoSQL databases started to gain popularity when dynamic web content needed to be stored and retrieved from a data store, and addressed problems with relational database management systems (rdbms) had at the time, in being able to flexibly adapt data models, as quickly as a web development team could develop code.

Relational Database Tables

To query all of the information about James Dey, in the example below, you’d type:-

select * from employees e, addresses a where employee_first_name = 'James' and employee_surname='Dey' and a.employee_id = e.employee_id;

Although a number of databases are collectively referred to as “NoSQL, there are 4 distinct NoSQL families which have different features and meet different use cases:-

Document stores

MongoDB is the most popular document store. Document stores allow flexible storage of documents, where each record can have the following features:-

  • A record contains a document
  • It is variable length
  • Types of values of a particular column can vary from record to record
  • A column can contain an array
  • A document can have a nested structure

Document Store Collections

To query all of the information about James Dey, you’d type (if using MongoDB):-

db.employees.find({employee_first_name:James, employee_surname:Dey});

Document stores have become very popular amongst web application developers as documents can be generated containing nested hierarchical data using formats such as JSON and XML, which more closely matches the way in which web application developers code than relational databases do.

Wide column stores

Cassandra and HBase are currently the 2 most popular wide column stores.

A column store refers to the fact that data is stored in columns which reference row keys, as opposed to being stored in rows containing columns which is the way that relational databases usually store data.

A row-based approach is advantageous where you’re selecting individual records for update. For example, if you want to update details for employee “James Dey”, it makes sense to retrieve the entire record for that employee.

Column stores are advantageous, where you want to search a large set of records based on a particular column value. For example, if you want to count all employees who live in London, a column store has an advantage since you simply have to retrieve the record where column=’London’ and count the referenced row keys, whereas in an unindexed row-based rdbms, you’d have to retrieve all employee records, check to see if the city was London and then increment the count.

A wide column store has the following characteristics:-

  • Data is stored in columns within column families within a row.
  • A column family is a set of columns which make up all or part of that row.
  • Within each column family, there are 1 or more columns.
  • Data within a column family is physically stored together
  • Each cell consists of a key-value pair, where the key is a combination of row key, column family and column (qualifier).

HBase table

To query all of the information about James Dey, using the HBase Java Api, your code would look like this:-

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "Employees");

SingleColumnValueFilter f1 = new SingleColumnValueFilter(Bytes.toBytes("Employee"), Bytes.toBytes("Employee First Name"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes("James"))

SingleColumnValueFilter f2 = new SingleColumnValueFilter(Bytes.toBytes("Employee"), Bytes.toBytes("Employee Surname"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes("Dey"));

FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);

Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {

Wide column stores have become popular amongst the Apache Hadoop community, since HDFS only allows sequential access to data, which is only suitable for high latency batch processing. HBase & Cassandra can sit on top of HDFS and provide indexing and random access, allowing individual records to be retrieved with low latency.

Key/Value stores

Redis is currently the most popular key/value store. A key/value store is the simplest form of store, consisting simply of a key and a value for every attribute stored.

Key Value Store

Key/value stores are popular mainly due to their simplicity and hence relative speed in persisting data. They’re great for persisting web form data, for example, so as to allow multiple application servers to pick up the data.

Graph data stores

Currently, the most popular graph data store is Neo4J

In a relational database, relationships are actually not handled particularly well, with bridging tables required to handle many to many relationships. Entities within graph data stores hold pointers to records in other entities to which they are related. This means that the costly lookup which compares a foreign key column in 1 entity with a primary key column in a related entity is no longer required, as the entity (node) knows how all records contained within it are related to all other records in the data model (graph).

Due to the speed at which relationships between entities can be queried, graph data stores are popular in social media sites such as twitter, facebook and linkedin, where establishing a multitude of complex relationships between users is very important.

Graph Store Big Data Presentations

Positives & Negatives of Big Data

What do we do with all this big data?

Big data is better data

The surprising seeds of a big data revolution in Healthcare

How data will transform business

The curly fries conundrum

Why privacy matters

Make Data More Human

The dark side of big data

How do you run an online advertising campaign?

A big user of Big Data are the advertising industry, so it’s worth understanding how advertisers go about running online advertising campaigns and analysing the results.


Placing the Advertisements

Decide how you whether you wish to run your ad campaign:-
1. In-house. Advantage is that you will have full control of the content and placement of the advertisement.

2. Via an advertising agency. Advantage is that the advertising agency specialises in the placement of ads and knows the best way to target your desired audience.

3. Use an affiliate network– generally used where the advertiser is happy to pay for advertisements placed by publishers. Somewhat of a scatter gun approach.

Creating an Ad campaign via an affiliate network
Creating an Ad campaign via an affiliate network

Capturing Ad Statistics

Capturing impressions, page views & unique visitors via an affiliate network
Capturing impressions, page views & unique visitors via an affiliate network
Capturing click-throughs via an affiliate network
Capturing click-throughs

Reporting captured Ad statistics

You can then choose to either:-
1. Develop in-house reports
2. Rely on reports by the affiliate networks (if you’ve chosen that route)
3. Rely on a 3rd party analytics tool e.g. Google analytics

Generally, if you’re a large organisation running ad campaigns both in-house and via affiliate networks, you’ll want to receive the statistics from the affiliate networks and report yourself. Google analytics is suitable for small companies, and affiliate networks is ok for companies who just use a single network.

What are you trying to achieve and how do you achieve it?

Some example questions that an advertiser asks to determine the success of an ad campaign:-

How many users have clicked on the ad – “click throughs”

These statistics can be captured by adding parameters to the ad hyperlink. The parameters can contain information about the campaign, the type of ad, and the source webpage and/or any bespoke parameters you may wish.

For example, this is an ad hyperlink provided by tradedoubler on behalf of an advertiser:-<a href=”; title=”My Anchor Text” target=”_blank”>My Anchor Text</a>

The parameter information – p(256096)a(1704967)g(22300378) allows tradedoubler to lookup details about the ad, and increment the correct click-through count.

The parameters can be read by the web server when it receives the target webpage request and the information stored in a database.

A name for website visitors who clicked on an ad hyperlink to get to the site is a referral.

How many times has a particular advertisement been presented to the user – “impressions

The Click-through rate-click throughs/impressions can be compared with click throughs/page views to provide information as to how changing the number of ads on the web page generates more attention or not.

Impressions can be captured by the web server which stores the Ad image requested by the web site where the Ad is placed, by incrementing the impression count for the campaign whenever the image is requested. The Ad image contains a parameter for the campaign id. For example, An Ad image controlled by the tradedoubler affiliate network looks like this:-

<img src=”; />

The parameter information after the ? allows tradedoubler to work out which campaign the ad is associated with.

How many times has a page been presented to a user on which an advertisement is located – “page views

Click throughs/page views will provide an indication as to how successful your ad was at receiving attention.

Page views can be worked out by aggregating impression counts for all objects on the same webpage.

Note: Impressions can be more than page views as multiple ads could appear on a single page.

How many users came directly to your website as opposed to referrals?

Direct users are usually more valuable than referrals as they’ve found your site generally from a search engine or word of mouth, rather than via an ad, and are more likely to register/buy a product. A user whose come directly, won’t have any associated referral parameters in the url that they used to arrive at the site, so the server can spot this and record information about the direct visitor in a database.

How many unique visitors have there been during a defined period of time?

This is useful to eliminate skewed results e.g. where somebody could multiply click on an ad, in order to generate ad revenue from your company.
The main method of identifying a unique visitor is to store a tracking cookie, the 1st time that they land on your target webpage with an expiry period of your choice (typically 24 hours). Each time a request for a target webpage is made, the target webpage will run a script to retrieve a valid tracking cookie, if 1 exists or to set one if it doesn’t. If there is no valid tracking cookie, the script can inform the server which can log a unique visitor event. The drawback of tracking cookies is that users can disable them. Another method of tracking unique visitors is to log the http header and parameter information passed to the server when the visitor first gets to the site.

How many users who clicked through to your website actually went on to register or bought a product – “Converted”.

This can be further split down by those who immediately converted or those that converted at a later time.

Converteds can be captured by checking whether a visitor who subsequently registered or bought a product was originally a referral or not. This can be done by tying the information in the tracking cookie against the information captured about a unique visitor when they first visited the site. An immediate can be deduced by comparing the registration time against the time that the unique visitor was first detected on the website.

Where’s the best place for the Ad to be placed and which type of AD generates the most clicks?

The advertiser will want to know if a banner ad worked better than a skyscaper ad, and whether a pop-over ad worked better than a static ad, for example.

This information can be captured by embedding parameters in the ad hyperlink url. The server will extract the parameter information and store the results in a database.

Which advertising channel generated the most traffic?

For example, did more visitors arrive due to an email campaign, a mobile app or a web ad?

This information can either be deduced from information sent as part of the HTTP Header with all web page requests submitted to the server, or as a parameter added to the ad hyperlink url.

Which is the best geographic location for the ad

Ads can do better in certain countries, so this helps target the audience.

Can work this out by writing code which does a lookup on the location of the ip address. Prone to error, however, as a user can often be located in a different country to the referring web server.

Determine profile of users who converted

This can be established by asking questions of the user when he first registers on the site. The information is then used to try to improve conversion rates by better targeting the audience who are likely to be interested in your product.

Affiliate Networks

For most organisations, it would be very labour intensive to manage a network of publishers who are advertising your products (many of whom are home workers) on your behalf. For this reason, there are specialist organisations referred to as affiliate networks who do this on your behalf e.g. TradeDoubler, DoubleClick. The publisher signs up with the affiliate networks and are offered various rates of commission, typically based on pay per click (PPC), to place an advertisement. The advertisement could be placed on the publisher’s own website or they might purchase a Google Adword if the commission is cost-effective. The ad contains a hyperlink which redirects the user via the affiliate network to the target landing page. The affiliate network can then record that a click through has occurred. The ad also contains a link to an image stored on the affiliate network server, so when this is served up, the affiliate network increments the impression counter for the ad, and for the publisher.