Bucketing Data Values Using Set Files

by Don Purnhagen

In data warehouses, it is common to map discrete data values to a set of ranges. This makes it easier to write queries that involve a range of discrete values. Consider the following two SQL queries that both generate information about sales during a fiscal quarter:

SELECT MIN(QTY_SOLD), MAX(QTY_SOLD), SUM(QTY_SOLD) FROM SALE WHERE
        SALE_DATE >= TO_DATE('20110101', 'yyyymmdd') AND
        SALE_DATE < TO_DATE('20110401', 'yyyymmdd');

SELECT MIN(QTY_SOLD), MAX(QTY_SOLD), SUM(QTY_SOLD) FROM SALE WHERE
        SALE_QTR == '2011Q1';

You can see that the second query has a simpler WHERE clause. When working with complex join queries, involving many dimension tables, the complexity of the first approach increases with each discrete value range that must be filtered.

When data is moved from a production system to a data warehouse employing data ranges, the discrete data values must be replaced with identifiers that will map them into the proper range, or bucket. When using the IRI Voracity ETL platform, or a component IRI product such as CoSort, NextForm, FieldShield, or RowGen, to move data into a data warehouse, there are a couple of techniques that can be used to assign data values to a particular bucket.

For this example^[1], we will use a table of items, which includes the item price. When we load the dimension table for the items in the data warehouse, we want to assign the items to one of three price ranges. This is a simplistic, and somewhat contrived example, but the concept will extend to more realistic use cases.

Below, ITEM.ITEM_PRICE is a column which contains a number in the range of 0.01 to 99.99. We want to group the items into buckets by price range. The price range dimension table looks like this:

Table: ITEM_PRICE_RANGE
PRICE_RANGE	PRICE_START	PRICE_END
LOW	00.01	19.99
MID	20.00	69.99
HIGH	70.00	99.99

The simplest way to assign the items into the three buckets, as described above, would use an IF-THEN-ELSE statement in the output section of your (SortCL) job (script). The output field for PRICE_RANGE could contain:

/FIELD=(PRICE_RANGE, IF PRICE <= 19.99 THEN "LOW" \
        ELSE IF PRICE <= 69.99 THEN "MID" \
        ELSE "HIGH"

When the range of discrete values is greater, and there are many more buckets however, this approach gets unwieldy. A different approach using set files can simplify your job scripts.

A set file for IRI products is a plain text file with data values on each line. A simple set file contains just one data value per line, and is often used to select random, realistic looking values when generating test data with IRI RowGen. There can also be set files with multiple, related data values on each line, separated by a tab character (ASCII 9).

For the purpose of assigning bucket identifiers in this case, we will use a set file where the end value of the range is the first value, and the name of the range is the second value. Consider this file, that we saved with the filename price-range.set:

19.99        LOW
69.99        MID
99.99        HIGH

The use of this technique requires a much shorter, and simpler field statement in the IRI job script. Furthermore, a change in the script is not necessary to change the bucket start or end points.

To change the range for a bucket, just edit the set file values. You can even add additional buckets without any further modification of the job script field statement.

The field statement to use a set file as described above looks like this:

/FIELD=(BUCKET, TYPE=ASCII, POSITION=2, SEPARATOR="\t", \
        SET="price-range.set"[PRICE] DEFAULT="*" SEARCH=GE)

If we wanted to increase the number of buckets to five in this example, we could edit the price-range.set file to look like this:

09.99        LOW
24.99        MIDLOW
49.99        MID
69.99        MIDHIGH
99.99        HIGH

We could run the IRI job script again with the new set file, and the bucket values would change as the new set file is used.

You can even use an IRI job script to build the set file from the range table. Here is a sample script that will extract the range end and index from our sample table:

# Generated with the IRI Workbench - New Reformat Job
# Extract the ID and END value for a lookup set to select buckets.
/INFILE="SCOTT.ITEM_PRICE_RANGE;DSN=Oracle;"
    /PROCESS=ODBC
    /FIELD=(PRICE_RANGE, TYPE=ASCII, POSITION=1, SEPARATOR="\t")
    /FIELD=(PRICE_END, TYPE=ASCII, POSITION=2, SEPARATOR="\t")
/REPORT
/OUTFILE=price-range.set
    /FIELD=(PRICE_END, TYPE=NUMERIC, POSITION=1, SEPARATOR="\t", \
        SIZE=5, PRECISION=2, FILL='0')
    /FIELD=(PRICE_RANGE, TYPE=ASCII, POSITION=2, SEPARATOR="\t")

To put it all together, here is an IRI job script that would read the item values from a relational table, and load them into a dimension table. It also converts the price into a bucket based on the range values we extracted from the ITEM_PRICE_RANGE table:

# Generated with the IRI Workbench - New Sort Job
# Load item dimension table, converting price into a bucket value.
/INFILE="ITEM;DSN=Oracle;"
/PROCESS=ODBC
    /FIELD=(ITEM_ID, TYPE=NUMERIC, POSITION=1, SEPARATOR="\t")
    /FIELD=(ITEM_NAME, TYPE=ASCII, POSITION=2, SEPARATOR="\t")
    /FIELD=(PRICE, TYPE=NUMERIC, POSITION=3, SEPARATOR="\t")
    /FIELD=(CATEGORY_ID, TYPE=NUMERIC, POSITION=4, SEPARATOR="\t")
/SORT
    /KEY=(PRICE)
/OUTFILE="ITEM_DIM;DSN=Oracle;"
    /PROCESS=ODBC
    /FIELD=(ITEM_ID, TYPE=ASCII, POSITION=1, SEPARATOR="\t")
    /FIELD=(ITEM_NAME, TYPE=ASCII, POSITION=2, SEPARATOR="\t")
    /FIELD=(PRICE_BUCKET, TYPE=ASCII, POSITION=3, SEPARATOR="\t", \
        SET="price-range.set"[PRICE] DEFAULT="*" SEARCH=GE)
    /FIELD=(CATEGORY_ID, TYPE=ASCII, POSITION=4, SEPARATOR="\t")

[1] This example uses IRI job scripts to move and transform the data values. For a primer on such scripts, see the SortCL program chapter in this CoSort technical overview booklet.

UUID Support in IRI Products

Drawing Values from Set Files