All About IRI Set Files: A Primer
Introduction: In the context of IRI software, set files are text files containing realistic or replacement values for data management applications in the IRI Voracity platform ecosystem. They are columnar lists of tab-separated values (TSV) that can be sourced externally or internally.
Here is an example of an online data source that can be converted to a TSV IRI set file format:
These dataset files are specified as input or output field attributes in SortCL-compatible job scripts that perform:
- Lookup transforms in CoSort-powered Voracity ETL jobs
- Bucketing discrete values into anonymous, general categories defined by named ranges
- Data validation in Voracity data quality/cleansing jobs
- Associating items for data quality, master data management, etc.
- Pseudonymizing (replacing) names or other values in IRI *shield data masking jobs
- Restoring pseudonyms to their original values or preserving RDB value pairs
- Synthesizing randomly selected realistic or valid pair test data in RowGen jobs
Set File Contents
Set files are text files composed of one or more tab-separated columns. The columns must contain ASCII readable characters, UTF8 characters, or EBCDIC characters.
Each row ends with a linefeed. Comments can be placed at the top of a set file and must be preceded by a # symbol.
Below are a couple of different set files. The one on the left, a two-column set, is typically used for consistent, reversible pseudonyms. The single-column set is more common for random selection.
Ways to Create Set Files
IRI provides a number of set files with the software. Set file data can also be created:
- With a text editor where all the set data is typed into the set file
- Using IRI Workbench by selecting New Set File … from the RowGen menu, then choose one of the wizards listed on the selection page. See this article for details.
- From the fields of data files where SortCL or another utility is used to extract the data
- From columns in a database using a database procedure, using the Set File from Column wizard in IRI Workbench, or by using a SortCL script.
By convention, set file names have a .set extension.
Types of Set Files
Here are descriptions of the types of set files that can be used:
1. Numeric Set File. These can have any combination of numbers and numeric ranges.
Each row is randomly selected, and if the row contains a range, a value is randomly selected from the range.
Ranges have lower and upper limits separated by a comma and enclosed with brackets. A value next to a square bracket is included in the range while a rounded bracket indicates that the value is not included.
Below is an example of a numeric set file:
2. Date set file. This can have any combination of dates and date ranges that conform to existing CoSort date, time, and timestamp data type formats. Date ranges are handled in the same way that numeric ranges are handled. The ISO_DATE set file below has a date selected from the February range 50% of the time, while selecting from the other two ranges 25% of the time.
Note that the value 2022-03-15 cannot be selected because there is a parenthesis rather than a square bracket next to that value.
3. Multi-column set file. These have two uses; i.e., either for random selections or lookups.
a) In the case of random selection:
- Any column in a multi-column set file can be used for a random pull
- The rows in a multi-column set file can be composed of related items. A row can be randomly selected and multiple items in the row can be used as field values in a record.
- For a random lookup, the first column is ordered but not unique because the second column will have multiple items relating to each of the items in the first column. This can be done with any number of columns, but the last column must have unique entries with respect to the previous column. Then a random selection is made from the last column.
For example, suppose you need to randomly select a city within a specific state and county for each input record. Let’s say that the input records contain these fields or columns:
and a CITY needs to be selected at random for each COUNTY and STATE combination within the input records. The output records will then have a CITY added to the records.
Suppose a particular input record contains these values:
The section of the set file state_county_city.set containing records for Brevard county in FL could have the following records:
Then, if the record in the set file is randomly selected from the above, the output record becomes:
b) In the case of lookups:
- For a 2-column lookup, the first column must be sorted and unique. With three or more columns, the first and second columns are sorted, and each column is unique relative to the prior column value. Values are searched for in columns from left to right, and a value in the last column value is used in the field.
For example, suppose you have input records with the following fields:
and you need to add the county for each CITY and STATE combination.
Suppose a particular input record contains these values:
The section of the set file state_city_county.set that contains records for cities in FL could have the following records that are ordered by STATE and then CITY:
By looking up the STATE FL and then the CITY Cocoa Beach, we find the COUNTY Brevard because each CITY and STATE combination is unique. The output record will be:
Specifying Set Files in Job Scripts
Sets are declared in /FIELD statements of SortCL-compatible job scripts following this syntax:
Here are more details for the arguments:
1. FieldName — Name of the field within the record being read/processed/written
2. FieldAttributes — Other field attributes used for formatting a field that will also have set
attributes
3. Set_Source — The set source must be enclosed in quotes and is one of the following:
- A quoted set file path and name
- An ODBC table name and connection string. When using an ODBC table for lookup values, there is another option required:
where lvalue is the name of the column in the table equivalent to the first or left-hand column of a set file, and rvalue corresponds to the right-hand column in a two-column set file.
Here is a possible field statement in the /OUTFILE section of a job script:
- Just a list of comma-separated values (instead of an external lookup file or table) enclosed in curly braces in the /FIELD statement; e.g., { value1, value2, … }. This is useful when there are only a few values, as in this case:
4. Search List — This is a list of comma-separated values used for a lookup and can also be used in pairs testing. There can be up to one less lookup value than the number of columns in the set file. The items in this list can be a field name or a literal separated by commas, while the list is enclosed in square brackets [ item1, item2, . . . ].
For example, here is the three-column set file state_county_city.set where the first column is state, the second is county, and the third is city:
The following job script, city_lookup.scl, selects random cities within each state and county combination:
The input file rlook.dat contains:
When the job script above is run, the output file rlook.out may contain:
The output field value for city was randomly selected for each output record using the state and county field values in the record for the random lookup.
5. DEFAULT=”string” — When a lookup search does not find a match in the set file, you can define a default string to place into the data file, like “No match found.” If no default is specified, and no match is found, execution stops and an error message is displayed.
An example field statement could be:
6. Order Options — These are used when the set file is being used as a lookup. A set file is assumed to be sorted unless otherwise indicated by the Order Option NOT_SORTED. The order options for a set file are:
- PRE_SORTED — Assumes the set file is already ordered correctly for doing a lookup with no duplicates for the appropriate column. This is the default setting.
- NOT_SORTED — The set file will be sorted internally prior to processing the job script. The sorted set file will not be saved.
7. Select Options — These are the options used for random draws:
- ANY — Values in the set file are randomly drawn from any row of the set file. Values can be drawn any number of times. ANY is the default.
- ONCE — Values are drawn from the rows of the set file sequentially starting at the top row of the set file. When all the rows have been drawn, then no more values from the set file will be inserted into the field in the data file, and that field will be empty for the remaining records.
- ALL — Values are drawn from the rows of the set file sequentially. All the rows will be used for values if the data file has a sufficient number of records to do this. If there are more data records than rows in the set file, the values drawn from the set file will be repeated.
- SUFFIX — As with ALL, values are drawn from set file rows sequentially, but if there are more data records than rows in the set file, an underscore and numeric value in sequence is added to values in the data file as the set file values repeat.
For example, field values will be similar to this when names are put into a field:
There are four first names in the set file. When the values in the set file started repeating in the data file, the suffix number was appended to the field value.
- ROW — This is used with multi-column set files where the items in each column of a row in the set file are related.
When a row in the set file is randomly selected, values from that row are placed in the same record that is being created with the job script. All fields in a record that have a value derived from the same row or line in a set file will invoke ROW with an index number based on the column number from where the value is drawn in the set file.
An example of a 3-column set file that uses ROW, is to have one column be a state, another be a county in that state, and the third be a city within the county.
Given a set file with three related columns consisting of state, county, and city respectively, here is an example of field definitions where ROW determines which column in the set file is used for the field value:
For any ROW in the set file 3column.set, city is located in the 3rd column, county is located in the 2nd column, and state is located in the 1st column. Those field definitions could yield records containing the values:
- PERMUTE — This is used in RowGen jobs where the input format uses the /PROCESS=RANDOM statement. For each field that has PERMUTE as the selection parameter for a set file, there will be an attempt to select all possible combinations for those set files, up to the limit of any /INCOLLECT value.To use all combinations but no more than that, set /INCOLLECT equal to PERMUTE. If /INCOLLECT is set to a number greater than all the combinations, the combination pattern will repeat.
For example, suppose you are generating records to get values for the fields for first name and for last name from the set files fname.set and lname.set plus generating ID numbers. Here are the set files:
To generate the values with all combinations of first and last name once, use the following script where /INCOLLECT is set to PERMUTE.
The output gives the following six records which have all the possible combinations for first and last name:
But suppose I want 12 records. The /INFILE phase of the script would specify /INCOLLECT=12 while keeping SELECT=PERMUTE for the fields fname and lname. The output will then be:
The first six records contain all the combinations for first and last name. The next six records repeat that combination. The combination of first and last name will continue to repeat for any number of records required by the /INCOLLECT value.
8. Search Options — These options describe how to use set files with parameters to obtain unequal values, as with Slowly Changing Dimensions.
- EQ — The default, where you are looking for an exact match in the set file.
- GT — Takes the first value in the set file greater than the search parameter.
- GE — If an exact match is not found, then the first value encountered that is greater than the search parameter is selected.
- LT — The search parameter is looking for the nearest value in the set file that is less than the search parameter.
- LE — If an exact match is not found, then the nearest value encountered that is less than the search parameter is selected.
I have an example using a search option on the set file where the value provided for the output record is less than or equal to the field value in the input record. The PastPrice.set file is a two-column set where the first column has dates in ascending order and the second column is the price on that date:
Given the input data file FindPrice.dat:
and this job script, FindPrice.scl:
the values in FindPrice.out are:
The date 20140105 is before the date of the first record in the set file; therefore, the DEFAULT string is put in the price field. The date 20140821 falls between the dates 20140802 and 20140917; therefore the price will be the 24.75 value associated with the 20140802 date.
If you have any questions about the use of set files or their options, please contact support@iri.com.