Profiling your data

In order for data to be useful, it needs to be of good quality. The first step in ensuring that you have good quality data is to profile your date i.e. examine your source data to determine it’s structure, content & meaning.

How do you do this?

If you have small amounts of data, it’s relatively easy to simply examine the data visually and note down any problems that you spot. You will also need to eyeball sample data in order to verify that you fully understand the meaning of certain data columns. However, in most circumstances writing automated scripts which assist with identifying structure and content problems, preferably using a product that specialises in this area, will be much more efficient. See Gartner’s Magic Quadrant for Data Quality tools, to choose the best data quality product for your organisation.

Why do we have poor quality data?

The main reason is because validation rules at data entry haven’t been created. This may be because your data source doesn’t allow validation e.g. free-form Excel, or because the application developer hasn’t put them in because of pressures on delivery times or wishing to improve the user experience by allowing them to enter the minimal data required to serve the core purpose.

 

Common content problems with data

Address entries

Nowadays, web sites typically ask you to enter your post or zip code and then do a lookup on the addresses and you choose which house you live at. This ensures that addresses are relatively standard. However, in the past, and if users choose to enter addresses in free form, problems occur in that users won’t put the same data in the same field. For example, you might have address line 1, address line 2, city, country as your address details. So if you live in flat 1a, 15 Cowper Street, London, UK. There are several ways that you can enter this information. You could put the flat and the street information in address line 1, or you could put the flat information in address line 1 and the street information in address line 2. You might abbreviate 15 Cowper Street to 15 Cowper St. You might enter UK, United Kingdom or England as the country etc.

Phone numbers

In the UK, a phone number could be entered as 0208 421 1343, 020 8421 1343, (020) 8421 1343, +44 208 421 1343 or a variety of other ways. Each country has their own phone formats, so validating phone formats in each country is challenging.

Date formats

In the UK, we tend to write dates in DD/MM/YYYY format e.g. 13/06/2015. The ISO Date standard is YYYY-MM-DD. Different countries/regions have their own format.

Email addresses

It’s common for people to enter the wrong email address, if they don’t want to be contacted, so if you’re trying to identify unique customers by using email addresses, you’ll need to validate the email address entries.

Spelling mistakes

Is Derk Johnson, the same as Derek Johnson. Probably, but you’lll need to verify that by looking at other information such as their address, email address, social security number, date of birth or passport number.

Missing values

If your entry form doesn’t make certain fields mandatory, users will typically skip the optional fields. If you made all of the fields mandatory, it would solve the problem, but the user may become so frustrated with having to enter data that is unnecessary to them that they may abandon their order. For this reason, reports that rely on the data being entered will suffer.

Data profiling to validate structure


As well as content validation measures specified above, your database administrator will also want to run profile tests to ensure that they structure the data store used for reporting correctly.

Distinct counts


By examining the distinct counts of 1 or more fields in a record and comparing it with the total number of records, you can determine whether you’ve found a primary key (unique identifier for your set of records). For example, you might think that first name and last name for your customer was a good unique identifier but when you do a distinct count of 1m customers, you find that this combination only provides 950,000 distinct customers, demonstrating that this choice of primary key is invalid.

Maximum Lengths


By finding the maximum length of each field, the DBA can ensure that columns are sized appropriately, reducing any wasted space which saves storage and processing costs.

Number Formats


Checking to see if a field marked as decimal or float actually contains integers will allow the DBA to choose the correct number data type. This may improve index performance, reducing query times.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s