Working Towards Data Quality

by Paul Friedland

Introduction

In this article, I suggest ways to move your company’s data towards a higher state of quality. The highest quality occurs when the data meets the needs of your company. These needs are expressed in rules regarding legitimate values and relationships. This in turn requires that:

legitimate values and relationships are defined, and continually expanded and redefined;
you have the means to verify that the rules are being met; and,
you correct the incorrect data accordingly.

This is not as hard as it may appear. First, the highest quality is not perfect data, but only operationally sufficient data. Second, most data is wrong independent of context. That means you need to discover impossible or unlikely values. And fortunately, there are powerful tools to discover data that are possible, but wrong in context.

Overview

Data would be of the highest quality if it were an exact, complete, protected, and consistent representation of facts. In practical terms, an achievable goal would be to assure that the corporate data is of acceptably high quality which is “fit for their intended uses in operations, decision making, and planning” (J. M. Juran). This definition relieves you from having to do the impossible. Data does not need to be perfect, only be good enough to get the work done. Again, this is achievable.

High quality data is developed through data governance — the cooperation of management, auditors, programmers, and data architects who develop and maintain a system of data standards. The operation referred to as master data management (MDM) develops and enforces those standards. It also includes continuously monitoring and correcting the data against the standards.

To assess the return on the investment in data quality improvement efforts, you must monetize the costs of achieving and maintaining high quality data against the penalty for having not-so-high quality data. Costs come in the form of planning, implementation, discovery, and remediation.

In short, garbage in derives garbage out. Unfortunately, the benefits of high quality data are usually not appreciated until something goes wrong because the data has lead to an erroneous result.

Data Governance

High quality data needs definition for your enterprise. This takes the form of rules about the possible values and required relationships for the data. Data governance teams or designated information stewards may make these rules initially. But ultimately data architects, DBAs, and other IT users will have to manage within that framework to adjust the rules as data grows. Slowly changing dimensions and business rule changes should also be recorded with data lineage and asset control repositories, respectively.

Sometimes the best that can be done is to control how things might look. As a meaningful double negative, the data in the base may not be correct but it does not look incorrect. For example, this article was written in February. This may or may not be true, but clearly it was not written in the month of Yraubef. This is critical knowledge in maintaining data quality because at least when you know that the data is definitely wrong, you can correct it.

A simple plan begins with knowing the rules and centrally recording the possible values or relationships of each datum in an MDM repository (see Master Data Management below).

Some databases have a facility for recording these constraints and checking these values on data entry. Not all databases work the same way, however, and flat files do not have this feature at all. We propose taking the metadata off-line and enriching it. The offline records should be database independent.

For each database table and file, consider at least creating a spreadsheet containing columns for:

name of datum
meaning
source (table, file name)
other selected meta data (e.g. security, frequency)
any pertinent business rules that have to be followed in order for the data to be correct

Without even getting to the rules, you should be aware that across tables, when the same name has more than one meaning or when the same meaning has more than one name, there is a problem which eventually needs to be corrected. Ideally, this would also be application- independent, but that comes later.

Rules

There are three basic rule types pertaining to data values:

numeric ranges; e.g. 0 <= Age <= 150
set membership; e.g. Party in/not in [REP , DEM, IND]
data dependency; e.g. Distance = Rate x Time

These simple rules on rules can become quite complex:

rules can exist in combinations involving logical expressions
membership in a set could refer to external sets
data dependency could refer to data in different places.

Even with the above problems, the ability to check those conditions would catch many instances of data errors. Just finding alphabetic characters where numerical characters should appear, and vice versa, would be useful.

Master Data Management

Those responsible for Master Data Management (MDM) also charged with maintaining quality data — both in the master data itself, and then the transaction data that is measured against it – – must develop, verify, and correct data so it conforms to business rules. We propose that MDM teams use data-verifying programs to check the data against those rules. There may not be many rules at the outset so that these program will be small. These programs can be developed and executed as MDM initiatives grow. Such rules should be in a form useful to application developers before the programming begins.

Inconsistencies

After the data pass the value tests, the next level of verification checks for consistency of the data across tables. In theory, a relational database is application-independent so that a data value would only appear in one table. But if the same value appears in multiple tables, we need to deal with the problem of having more than one table with acceptable, but nonetheless different values.

MDM users can use “outer-joins” for finding two fields in two tables that do not have the same value. Outer joins are used mostly to find changes between the before and after tables (change data capture). However there is no requirement that the records be the same. The two tables are ordered on the value to be checked.

Automation

To automate master and transactional data quality improvement (and hopefully not unduly burden production processes), we see two options:

Online — integrate data quality checks within SQL procedures, or either inside or between data integration tasks like extraction or loading. In the case of CoSort (below), data cleansing and standardization can be built directly into data transformation and reporting job scripts, and do not require a separate I/O pass.

Offline — External programs can check new data for errors and inconsistencies at the close of business or when traffic is low. Or, that data can be off-loaded onto smaller machines, or evaluated on idle servers or networked PC’s. Where execution speed is not critical, what does not get checked every night might wait for the weekend. The output of these programs will be ready for review and correction the next morning.

Resources

Not surprisingly, IRI provides the tools for data validation, and in many cases, remediation. Our users know the IRI Workbench (Eclipse GUI) and/or the 4GL script of the CoSort SortCL language it supports, which enable them to automatically:

acquire and integrate data from multiple files and tables
apply include/omit logic to verify value and ranges
search sets to confirm membership
employ multi-table inner and output joins to silo matches from non-matches
direct output to multiple targets with good, bad, and dubious data
perform conditional find and replace operations
substitute known ‘inferior’ values with master data values using many-to-one lookup sets
employ field cleansing routines and third-party standardization libraries (e.g. from Trillium or Melissa Data)
produce summary reports and query-ready runtime logs for analysis and compliance audits

Conclusion

For higher data quality, we have proposed: