Drawing Values from Set Files

by Paul Friedland

This article looks at sets from an informational processing perspective; what they are; how they are constructed; and, distinct ways in which data can be drawn from sets within IRI software products using the SortCL data definition and processing program; i.e., in CoSort and NextForm to replace values, FieldShield for pseudonomization, RowGen for random value selection, and the IRI Voracity platform (which includes them all) for lookup transforms in ETL and data cleansing jobs.

Defining Terms

A data set is a collection of objects that are grouped together because they are instances of the same kind. In an informational environment, a set could be the shared repository of everything of that class or, at least, everything that is meaningful to the enterprise. The objects are not necessarily ordered or unique. A set file is also valuable because it is reusable and application-independent.

In IRI software applications, sets are represented in ASCII files with ASCII character values. Each entry in the set, or row, or object, is a line-feed terminated record. IRI ships with several set files, has wizards to help you generate them from DB columns or other files, or you can simply supply your own.

An object can also be a composite; for example, name-address-phone number (presumably attributes of one person). Composite values are separated by a “tab” character, which can be easily seen on paper or a monitor. Every line would then have the same number of composite items; and if an item were null, the tab would be present.

An object can be a numeric range, either open or closed. It that object is drawn, the system will derive a value in that range and return it to the caller. Data values, such as binary data that naturally contain line feeds or tabs, cannot use the lookup methods shown here.

The objects can be derived from a database, produced in a production, edited by hand, or drawn from other sets. Once built, we will draw objects for further processing. We now address how the same sets can be referenced and used for different purposes.

In Practice

The “SortCL” processing language of the IRI Voracity platform and its component products (CoSort, NextForm, FieldShield, RowGen) has syntax for extracting a value from a set and assigning it to a field.

Set values can be used in: Voracity ETL jobs for lookup transforms; in CoSort scripts for data mapping, enrichment/validation, and/or reporting; in NextForm file or database table migrations or updates; in FieldShield pseudonymization jobs; and, in RowGen test data synthesis jobs when random selections from real data are preferred over artificial value generation.

A basic reference would look like this:

/FIELD = (City, SET <select> = State_City.set [indices])

There are additional parameters regarding the position of City in the record which are well known to the above users. Consider just the optional [indices] portion:

With no indices, the draw is based on the relative position of the line in the set (e.g., first, last, or random).
With an index or indices, the selection is based on the content of the data in the set (e.g., the entry whose name is Smith).

Drawing by Position

Without regard to content, you can draw a value from the set by selecting a row or line number which corresponds to a record number of the set file.

RowGen (test data generation) applications make random draws. Note the benefit of having sets with duplicate entries, as:

Male
Female
Male

Over time, random draws will be 2 to 1 Male. No matter the number of different objects, any desired weighting can be achieved to any level of granularity.

Other types of draws make successive draws to capture every object in the set with/without repeating any entry. The concern now is what to return when there are more accesses than rows.

The <select> option is one of the following:

ANY (default)	uses a random number [e.g., 1,20,000] as a line number
ALL	begin at line 1; subsequently increment by 1, after last, repeat
ONCE	same as ALL but after the last line, return NULL
SUFFIX	same as ALL but after the last, restart the list affixing a counter number at the end of the table entry
FIRST	select the first line of the set
LAST	select the last line of the set
ROW[n]	select the n’th composite from an ANY draw
PERMUTE	every combination of composite items from a draw

Drawing by Content

These are the fields statements where set indices are given. Indices are one or more comma separated string literals or field names. The number of indices is very important.

A reference into a table will have less indices than there are columns in the table. A trivial example of this is a one-column table which will return a random value when given zero search arguments. In this way, a positional draw described above is just a special case of a content dependent draw.

More typically, a table will have two or more columns. Say we had a short two-column table of state name and capitol city. If the search argument is the state name; the system would then return its capitol city. Two-column tables are used widely for pseudonymization and restoration, daily pricing, and anywhere that variable one-to-one mapping is required.

To draw a row from a specified row, it is necessary to impose organization. The set is considered as a database table, spreadsheet, or other flat-file organization. Viewed vertically, the composite values are the columns. To efficiently find a row with a specific value, the table needs be ordered; i.e., sorted from left to right in the order in which the indices will be presented. This allows a search to use logarithmic (log2n) algorithm rather than linear.

In this discussion, we will be drawing cities from a set file that contains 20,000 state-county-city triples representing every city in the United States. Every triplet will be unique.

The state_county_city.set file contains this subset of tab-delimited values from Pennsylvania:

Penn	 Adams		Abbottstown
Penn	 Adams		Benderville
Penn	 Adams		Biglerville
...
Penn	 Adams		Yorksprings
...
Penn	 Allegheny	Allison Park
Penn	 Allegheny	Aspenwalll
...
Penn	 Allegheny	Wilmerding
...
Penn	 Wyoming	Windham
Penn	 York		Dallastown
Penn	 York		Delta
...
Penn	 York		Yorklyn

For a table where n = 20,000, a specific row can be found, on average, in about 14 tries rather than 10,000. Just as with a database search, to find a specific row, the user needs to provide a key by way of value parameter which is then matched to a value in the table.

The search returns that part of the table entry which follows the index. Refer to the Penn entry in the
table. The statement:

/FIELD = (City, SET = State_City.set [“Penn”, “York”])

Since there is some 50 cities in York County so a possible result would be “Yorklyn” While:

/FIELD = (County_City, SET = State_City.set [“Penn”])

would return something like. “Allegheny Aspenwalll”. You could pare off the “Aspenwall” portion if desired. And if you want to be specific, you would need to use the <select> option, as in:

/FIELD = (County_City, SET FIRST = State_City.set [“Penn”])

which would return “Adams Abbottstown” which is the first entry in the Penn’s.

Or on two indices,

/FIELD = (County_City, SET FIRST = State_City.set [“Penn”, York”])

which is “Dallastown”

Conclusion

We have shown that sets — whether used for random access or to provide table searching — can be used in a wide variety of applications. Independent of access, sets provide a standard reference and give structure to enterprise data.

It should be clear that the number of indices can be extended to any depth. If your application does a lot of searching, you might consider the use of set files because of their simplicity and speed.

There are more options available for searching sets which await my next article. If you are interested in another manifestation, see this page on Slowly Changing Dimensions.

Bucketing Data Values Using Set Files

How to Generate DB Test Data