Working Towards Data Quality
Introduction
In this article, I suggest ways to move your company’s data towards a higher state of quality. The highest quality occurs when the data meets the needs of your company. These needs are expressed in rules regarding legitimate values and relationships. This in turn requires that:
- legitimate values and relationships are defined, and continually expanded and redefined;
- you have the means to verify that the rules are being met; and,
- you correct the incorrect data accordingly.
This is not as hard as it may appear. First, the highest quality is not perfect data, but only operationally sufficient data. Second, most data is wrong independent of context. That means you need to discover impossible or unlikely values. And fortunately, there are powerful tools to discover data that are possible, but wrong in context.
Overview
Data would be of the highest quality if it were an exact, complete, protected, and consistent representation of facts. In practical terms, an achievable goal would be to assure that the corporate data is of acceptably high quality which is “fit for their intended uses in operations, decision making, and planning” (J. M. Juran). This definition relieves you from having to do the impossible. Data does not need to be perfect, only be good enough to get the work done. Again, this is achievable.
High quality data is developed through data governance — the cooperation of management, auditors, programmers, and data architects who develop and maintain a system of data standards. The operation referred to as master data management (MDM) develops and enforces those standards. It also includes continuously monitoring and correcting the data against the standards.
To assess the return on the investment in data quality improvement efforts, you must monetize the costs of achieving and maintaining high quality data against the penalty for having not-so-high quality data. Costs come in the form of planning, implementation, discovery, and remediation.
In short, garbage in derives garbage out. Unfortunately, the benefits of high quality data are usually not appreciated until something goes wrong because the data has lead to an erroneous result.
Data Governance
High quality data needs definition for your enterprise. This takes the form of rules about the possible values and required relationships for the data. Data governance teams or designated information stewards may make these rules initially. But ultimately data architects, DBAs, and other IT users will have to manage within that framework to adjust the rules as data grows. Slowly changing dimensions and business rule changes should also be recorded with data lineage and asset control repositories, respectively.
Sometimes the best that can be done is to control how things might look. As a meaningful double negative, the data in the base may not be correct but it does not look incorrect. For example, this article was written in February. This may or may not be true, but clearly it was not written in the month of Yraubef. This is critical knowledge in maintaining data quality because at least when you know that the data is definitely wrong, you can correct it.
A simple plan begins with knowing the rules and centrally recording the possible values or relationships of each datum in an MDM repository (see Master Data Management below).
Some databases have a facility for recording these constraints and checking these values on data entry. Not all databases work the same way, however, and flat files do not have this feature at all. We propose taking the metadata off-line and enriching it. The offline records should be database independent.
For each database table and file, consider at least creating a spreadsheet containing columns for:
- name of datum
- meaning
- source (table, file name)
- other selected meta data (e.g. security, frequency)
- any pertinent business rules that have to be followed in order for the data to be correct
Without even getting to the rules, you should be aware that across tables, when the same name has more than one meaning or when the same meaning has more than one name, there is a problem which eventually needs to be corrected. Ideally, this would also be application- independent, but that comes later.
Rules
There are three basic rule types pertaining to data values:
- numeric ranges; e.g. 0 <= Age <= 150
- set membership; e.g. Party in/not in [REP , DEM, IND]
- data dependency; e.g. Distance = Rate x Time
These simple rules on rules can become quite complex:
- rules can exist in combinations involving logical expressions
- membership in a set could refer to external sets
- data dependency could refer to data in different places.
Even with the above problems, the ability to check those conditions would catch many instances of data errors. Just finding alphabetic characters where numerical characters should appear, and vice versa, would be useful.
Master Data Management
Those responsible for Master Data Management (MDM) also charged with maintaining quality data — both in the master data itself, and then the transaction data that is measured against it – – must develop, verify, and correct data so it conforms to business rules. We propose that MDM teams use data-verifying programs to check the data against those rules. There may not be many rules at the outset so that these program will be small. These programs can be developed and executed as MDM initiatives grow. Such rules should be in a form useful to application developers before the programming begins.
Inconsistencies
After the data pass the value tests, the next level of verification checks for consistency of the data across tables. In theory, a relational database is application-independent so that a data value would only appear in one table. But if the same value appears in multiple tables, we need to deal with the problem of having more than one table with acceptable, but nonetheless different values.
MDM users can use “outer-joins” for finding two fields in two tables that do not have the same value. Outer joins are used mostly to find changes between the before and after tables (change data capture). However there is no requirement that the records be the same. The two tables are ordered on the value to be checked.
Automation
To automate master and transactional data quality improvement (and hopefully not unduly burden production processes), we see two options:
Online — integrate data quality checks within SQL procedures, or either inside or between data integration tasks like extraction or loading. In the case of CoSort (below), data cleansing and standardization can be built directly into data transformation and reporting job scripts, and do not require a separate I/O pass.
Offline — External programs can check new data for errors and inconsistencies at the close of business or when traffic is low. Or, that data can be off-loaded onto smaller machines, or evaluated on idle servers or networked PC’s. Where execution speed is not critical, what does not get checked every night might wait for the weekend. The output of these programs will be ready for review and correction the next morning.
Resources
Not surprisingly, IRI provides the tools for data validation, and in many cases, remediation. Our users know the IRI Workbench (Eclipse GUI) and/or the 4GL script of the CoSort SortCL language it supports, which enable them to automatically:
- acquire and integrate data from multiple files and tables
- apply include/omit logic to verify value and ranges
- search sets to confirm membership
- employ multi-table inner and output joins to silo matches from non-matches
- direct output to multiple targets with good, bad, and dubious data
- perform conditional find and replace operations
- substitute known ‘inferior’ values with master data values using many-to-one lookup sets
- employ field cleansing routines and third-party standardization libraries (e.g. from Trillium or Melissa Data)
- produce summary reports and query-ready runtime logs for analysis and compliance audits
Conclusion
For higher data quality, we have proposed:
- responsibilities be assigned for developing and maintaining data quality rules
- centrally recorded and controlled metadata
- the creation of extensible programs that apply and verify rules
- the use of problem discovery programs during non-busy times
- continuous rule development and verification procedures