Data Quality Rules in IRI Workbench
IRI Workbench now contains a section of data cleansing, enrichment, and validation rules called Data Quality Rules for use in IRI CoSort data transformation and IRI Voracity ETL, reporting and data wrangling jobs. The selected rules automatically serialize many of the data quality functions, long supported in their common SortCL program, into task scripts or batch workflows ready for execution, modification, sharing, or expansion.
While some of these rules may be found in other categories, there is now a centralized page for them. The data quality rules are further divided into Field Rules and Section Rules, depending on how they are applied in SortCL scripts.
Field Rules
Alignment
This rule is used to align data. It is helpful for unifying the format of data, and in generating values for lookup (set) files where their column data need to be right- or left-aligned.
When used in the virtual record layout (inrec) or target (output) section of a job script, this rule aligns the field string without its leading or trailing fill characters. The remaining length of the target field is populated with spaces or a pad character.
For example, on this input source:
McKinley William 1897-1901 Roosevelt Theodore 1901-1909 Taft William H. 1909-1913 Wilson Woodrow 1913-1921 Harding Warren G. 1921-1923
the president field can be shifted to the right using right-align.
McKinley William 1897-1901 Roosevelt Theodore 1901-1909 Taft William H. 1909-1913 Wilson Woodrow 1913-1921 Harding Warren G. 1921-1923
Compound Data Values
Compound data values allow you to specify a standard format for data that you can use for enrichment, lookup transformations, and synthesis (test data generation). This rule does two things: It creates a set file containing the compound data and assigns that set file in the rule.
In the example below, a US-formatted telephone number is created in the wizard using parenthesis and a dash and populated with either custom-ranged or randomly-generated numbers:
String Quality Functions
This rule opens a page with a list of functions that can be applied to strings. It contains items such as changing case, substring, and trim.
For example, sub_string can change this input:
McKinley, William Roosevelt, Theodore Taft, William H. Wilson, Woodrow Harding, Warren G.
to this output:
McKin Roose Taft, Wilso Hardi
Section Rules
Contains Value Test
This rule applies a test to a field to see if the value specified is included in the field. That result is used to filter rows specified in the section or in a conditional if-then-else statement that creates a new field with data specific to whether the test passed or failed. See the Lookup Value Test below for an example with an if-then-else-statement.
For example, this condition and filter created by the details in the screenshot below,
/CONDITION=(CONTAINS_TEST, TEST=(isholding(LAST_NAME, "Smith"))) /INCLUDE WHERE CONTAINS_TEST
sends only qualifying rows (where the last name is Smith) to the target.
Data Type Test
This rule can be used to validate and include or exclude data on the basis of its actual type. Multiple types of data type or format tests are available from this function’s drop-down menu. The result is used in the section as a filter or in an if-then-else statement.
For example, this condition and filter created by the details in the screenshot,
/CONDITION=(DATA_TYPE_TEST, TEST=(isempty(LAST_NAME))) /OMIT WHERE DATA_TYPE_TEST
omits the record if the value of the last name field is empty.
Lookup Value Test
This rule checks to see if the value in the specified field is included in the specified set file or table. The result is used in the section as a filter or in an if-then-else statement.
For example, the details in this screenshot create a field created from the set file to test against, a condition to compare the value from the set file to make sure it matches that data (and doesn’t return null), and another field that uses that condition to return the desired value.
/FIELD=(LOOKUP_LAST_NAME, TYPE=ASCII, POSITION=9, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/names/names_last.set" [LAST_NAME] DEFAULT="FALSE") /CONDITION=(LOOKUP_LAST_NAME_TEST, TEST=(NOT ispattern(LOOKUP_LAST_NAME, "FALSE"))) /FIELD=(IF_FIELD, TYPE=ASCII, POSITION=10, SEPARATOR="\t", IF LOOKUP_LAST_NAME_TEST THEN LAST_NAME ELSE "Missing")
The result is if the LAST_NAME is contained in the set file, it will return the last name, else it will return “Missing”.
Pattern Test
This rule compares the value in the specified field to a pattern to determine if it matches. The result is used in the section as a filter or in an if-then-else statement.
For example, this condition and filter created by the details in the screenshot,
/CONDITION=(PATTERN_TEST, TEST=(ispattern(EMAIL, "\\b[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,4}\\b"))) /INCLUDE WHERE PATTERN_TEST
only includes records that have a properly formatted email address.
Value Range Test
Finally, this rule determines if the value in the specified field is in the range specified. The result is used in the section as a filter or in an if-then-else statement.
For example, these two conditions and two filters created by the details in the screenshot,
/CONDITION=(LOW_END, TEST=(SALARY LT 1500)) /CONDITION=(HIGH_END, TEST=(SALARY GT 5000)) /INCLUDE WHERE LOW_END /INCLUDE WHERE HIGH_END
test that the salary value is within the range of 1500 to 5000.
For more information on each rule, all the dialogs include help and tips upon hover to guide the user. If you need help implementing Voracity data mapping, data quality, data masking, or test data generation rules in IRI Workbench, contact or your IRI representative or email voracity@iri.com.