Cleaning your data

From the Profiling your data article, you should have a clear idea as to the structure, content and meaning of the data that you want to use. However, knowing that you have data quality problems requires you actually doing something about it. The term for fixing data quality issues is known as data cleansing.
Note: Other terms for this include data scrubbing, data sanitisation, data munging or data wrangling.

Whichever term you want to use, it involves 3 steps:- standardising your data, matching records, merging records.

Standardising your data

This refers to the need to ensure that all of your data conforms to your business’s validation rules, patterns and formats.

For the more complex problems such as address standardisation, it is more cost effective to utilise the services of organisations who specialise in capturing organisation or individual address information. Dun & Bradstreet are well known for organisational address data and Experian Data Quality can provide consumer data cleansing services with good coverage of US and UK consumers

Matching your data

Once your data is in a standard form, you will need to identify the primary key (unique set of identifying fields) in your records. You can use the following features from the data profiling section to establish a set of attributes which are suitable as a primary key:-

1..Distinct count  (which will tell you how many distinct occurrences of those attributes occur in a set of records)

2. Duplicate count (which will tell you how many of those distinct occurrences occur more than once).

3. Unique count (which will tell you how many of those distinct occurrences occur just once).

The closer that your unique count is to your distinct count, the better your identification of a primary key is. If you are selecting textual fields as candidates for primary keys, you will also need to consider whether exact matching is appropriate or fuzzy matching. An example of fuzzy matching is where names that sound similar are treated as the same e.g. Dereck and Derek would match.
Note: Unless you’re exceptionally lucky, you’re unlikely to get 100% uniqueness as you’ll inevitably have primary keys duplicated in some records. For example, if you have the same customer entered either in the same system or in 2 different systems, they are likely to have duplicate primary keys.

Merging your data

Finally, now that you know what your primary keys are, you need to produce a composite record which pulls in the best quality data from each of the records which share the same primary key. You can use information such as the Null count and Maximum length statistics from the data profiling section to determine which attribute in which record is likely to have the best quality of data. For example, if you have middle name field, the data source that has more entries for this field is likely to provide better quality data, and if that field contains the full name rather than just an initial, this may also be a good indication of quality.
Note: When you merge records that share a duplicate primary key, you should keep the original records and create a pointer to your new golden source of data. This is important in case there are any queries about how reliable your data cleansing exercise was.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s