SCD Type 3
Dimensional data that change slowly or unpredictably is captured in Slowly Changing Dimensions (SCD) analyses. In a data warehouse environment, a dimension table has a primary key that uniquely identifies each record and other pieces of information that are known as the dimensional data.
All the update methods for the different SCD types can be accomplished using the SortCL data manipulation program in the IRI CoSort data transformation product, or the IRI Voracity ETL and data management platform. The latter includes a graphical SCD job creation wizard for each type of report. Click here to learn about data integration in Voracity generally.
Most SCD variations use a full outer join to match records from the original master data source with records in the updated source based on equating a key from each. Records with matches need to be updated in the master. Records in the update source that do not have a match need to be added to the master.
Here is an overview of how to update a dimensional file using SCD Type 3 where I am maintaining product costs. The update is accomplished by joining with respect to the field ProductCode.
Type 3 SCD
With the Type 3 model, the current data and the historical data are carried in the same record. One of the major things to decide is how much history should be kept. With our example, I will keep a cost for three start dates. Enddate is the date when the cost in the record is no longer effective. If the cost is still effective, then the EndDate1 field will be set to 99991231. This is to avoid an empty value in this field. Here are the values contained in the starting master source (master3.dat):
ProductCode | Cost1 | StartDate1 | EndDate1 | Cost2 | StartDate2 | EndDate2 | Cost3 | StartDate3 | EndDate3 |
---|---|---|---|---|---|---|---|---|---|
C123 | 125.50 | 20110228 | 99991231 | ||||||
F112 | 2365.00 | 20120101 | 99991231 | ||||||
G101 | 19.25 | 20110930 | 99991231 | 21.25 | 20110501 | 20110930 | |||
J245 | 450.50 | 20110430 | 99991231 | 425.25 | 20101001 | 20110430 | 385.25 | 20100215 | 20101001 |
S022 | 98.75 | 20110515 | 99991231 |
The update records all have the same StartDate. The update.dat source contains the following field values for the records:
ProductCode | Cost | StartDate |
---|---|---|
F112 | 2425.00 | 20120701 |
J245 | 550.50 | 20120701 |
M447 | 101.75 | 20120701 |
S022 | 101.75 | 20120701 |
Below is the updated data. For each record that was updated, each cost, startdate, and enddate is moved to the right by one set. In the old current group for the record, the enddate is changed to the startdate for the new updated current set. Any values for the old third set are discarded.
ProductCode | Cost1 | StartDate1 | EndDate1 | Cost2 | StartDate2 | EndDate2 | Cost3 | StartDate3 | EndDate3 |
---|---|---|---|---|---|---|---|---|---|
C123 | 125.50 | 20110228 | 9991231 | ||||||
F112 | 2425.00 | 20120701 | 9991231 | 2365.00 | 20120101 | 20120701 | |||
G101 | 19.25 | 20110930 | 9991231 | 21.25 | 20110501 | 20110930 | |||
J245 | 550.50 | 20120701 | 9991231 | 450.50 | 20110430 | 20120701 | 425.25 | 20101001 | 20110430 |
M447 | 101.75 | 20120701 | 9991231 | ||||||
S022 | 101.75 | 20120701 | 9991231 | 98.75 | 20110515 | 20120701 |
In the IRI Workbench GUI for Voracity, there is a wizard to assist in the creation of scripts for updating Dimensional files and tables. This wizard is located in the Voracity dropdown on the navigation bar. First you pick the SCD type. Then the window where you select the files that are used for processing the update is displayed.
In the top table on the next screen, you determine how the data being updated is mapped to the master. In the middle grouping on the screen, choose the fields that give the status values for the records that are being updated.
In the dropdowns under Current and Historical Field Sets, the first set that you define, gives the field names for current values. Value Field is the dimensional value that is being updated. Start Field is the field that determines when the Value Field became active and End Field determines when it is no longer valid. Then define as many sets as you need for the number of historical values that are being kept.
Here is the job script that will update the master file:
/INFILE=master3.dat /PROCESS=DELIMITED /ALIAS=master3 /FIELD=(PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",") /FIELD=(COST1, TYPE=NUMERIC, POSITION=2, SEPARATOR=",") /FIELD=(STARTDATE1, TYPE=ASCII, POSITION=3, SEPARATOR=",") /FIELD=(ENDDATE1, TYPE=ASCII, POSITION=4, SEPARATOR=",") /FIELD=(COST2, TYPE=NUMERIC, POSITION=5, SEPARATOR=",") /FIELD=(STARTDATE2, TYPE=ASCII, POSITION=6, SEPARATOR=",") /FIELD=(ENDDATE2, TYPE=ASCII, POSITION=7, SEPARATOR=",") /FIELD=(COST3, TYPE=NUMERIC, POSITION=8, SEPARATOR=",") /FIELD=(STARTDATE3, TYPE=ASCII, POSITION=9, SEPARATOR=",") /FIELD=(ENDATE3, TYPE=ASCII, POSITION=10, SEPARATOR=",") /INFILE=update.dat /PROCESS=DELIMITED /ALIAS=update /FIELD=(PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",", FRAME='\"') /FIELD=(COST, TYPE=ASCII, POSITION=2, SEPARATOR=",", FRAME='\"') /FIELD=(STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",", FRAME='\"') /JOIN FULL_OUTER master3 update WHERE MASTER3.PRODUCTCODE == UPDATE.PRODUCTCODE /OUTFILE=master3.dat # Include only records that are to be updated /PROCESS=DELIMITED /FIELD=(UPDATE.PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",") /FIELD=(UPDATE.COST, TYPE=NUMERIC, POSITION=2, SEPARATOR=",") /FIELD=(UPDATE.STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",") /FIELD=(ENDDATE1_NEW="99991231", TYPE=ASCII, POSITION=4, SEPARATOR=",") /FIELD=(MASTER3.COST1, TYPE=NUMERIC, POSITION=5, SEPARATOR=",") /FIELD=(MASTER3.STARTDATE1, TYPE=ASCII, POSITION=6, SEPARATOR=",") /FIELD=(UPDATE.STARTDATE, TYPE=ASCII, POSITION=7, SEPARATOR=",") /FIELD=(MASTER3.COST3, TYPE=NUMERIC, POSITION=8, SEPARATOR=",") /FIELD=(MASTER3.STARTDATE3, TYPE=ASCII, POSITION=9, SEPARATOR=",") /FIELD=(MASTER3.ENDATE3, TYPE=ASCII, POSITION=10, SEPARATOR=",") /INCLUDE WHERE MASTER3.PRODUCTCODE == UPDATE.PRODUCTCODE /OUTFILE=master3.dat # Include only records that are not to be updated /PROCESS=DELIMITED /FIELD=(MASTER3.PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",") /FIELD=(MASTER3.COST1, TYPE=NUMERIC, POSITION=2, SEPARATOR=",") /FIELD=(MASTER3.STARTDATE1, TYPE=ASCII, POSITION=3, SEPARATOR=",") /FIELD=(MASTER3.ENDDATE1, TYPE=ASCII, POSITION=4, SEPARATOR=",") /FIELD=(MASTER3.COST2, TYPE=NUMERIC, POSITION=5, SEPARATOR=",") /FIELD=(MASTER3.STARTDATE2, TYPE=ASCII, POSITION=6, SEPARATOR=",") /FIELD=(MASTER3.ENDDATE2, TYPE=ASCII, POSITION=7, SEPARATOR=",") /FIELD=(MASTER3.COST3, TYPE=NUMERIC, POSITION=8, SEPARATOR=",") /FIELD=(MASTER3.STARTDATE3, TYPE=ASCII, POSITION=9, SEPARATOR=",") /FIELD=(MASTER3.ENDATE3, TYPE=ASCII, POSITION=10, SEPARATOR=",") /OMIT WHERE MASTER3.PRODUCTCODE == UPDATE.PRODUCTCODE /OMIT WHERE MASTER3.PRODUCTCODE EQ "" /OUTFILE=C:/IRI/CoSort95/workbench.orig/workspace/SCD/SCD3/master3.dat # Add new records /PROCESS=DELIMITED /FIELD=(UPDATE.PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",") /FIELD=(UPDATE.COST, TYPE=NUMERIC, POSITION=2, SEPARATOR=",") /FIELD=(UPDATE.STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",") /FIELD=(ENDDATE1_NEW="99991231", TYPE=ASCII, POSITION=4, SEPARATOR=",") /INCLUDE WHERE MASTER3.PRODUCTCODE EQ "" AND UPDATE.PRODUCTCODE NE ""