SCD Type 2
Dimensional data that change slowly or unpredictably are captured in Slowly Changing Dimensions (SCD) analyses. In a data warehouse environment, a dimension table has a primary key that uniquely identifies each record and other pieces of information that are known as the dimensional data.
All the update methods for the different SCD types can be accomplished using the SortCL program in IRI CoSort. In the IRI Voracity data integration (ETL) and data management platform GUI, there is a fit-for-purpose SCD job creation wizard (see below) that builds the SortCL script(s) you need.
Most SCD types use a full outer join to match records from the original data source with records in the update source based on equating a key from each. Records with matches need to be updated in the master. Records in the update source that do not have a match need to be added to the master.
Here is an overview of how to update a dimensional file using SCD Type 2 where I am maintaining product costs. The update is accomplished by joining with respect to field ProductCode.
Type 2 SCD
In this model, the current and the historical records are kept in the same file. In an active database you would likely have a surrogate key to use as the primary key for linking to the fact tables in addition to the ProductCode key which is used for the updating process. Because the current records and the historical records are contained in the same file, it is necessary to have a field that indicates if the record is the current record for the ProductCode and we need a field to indicate when the cost for the ProductCode is no longer effective. In this example we have:
- ProductCode : This is the identifier key field.
- Cost: Cost that became effective on the StartDate for the record.
- StartDate: this is the date at which the cost for the record became effective.
- EndDate: This is the date when the Cost in the record is no longer effective. If the Cost is still effective, then the EndDate will be set to 99991231. This is to avoid a null value in this field.
- Current: Y if the cost is still in effect, N if it is not.
The starting table already has 3 history records. They are the ones that have the value N for the field Current. The master source is called master2.dat and it contains the below data:
ProductCode | Cost | StartDate | EndDate | Current |
---|---|---|---|---|
C123 | 125.50 | 20110228 | 99991231 | Y |
F112 | 2365.00 | 20120101 | 99991231 | Y |
G101 | 19.25 | 20110930 | 99991231 | Y |
G101 | 21.25 | 20110501 | 20110930 | N |
J245 | 450.50 | 20110430 | 99991231 | Y |
J245 | 425.25 | 20101001 | 20110430 | N |
J245 | 385.25 | 20100215 | 20101001 | N |
S022 | 98.75 | 20110515 | 99991231 | Y |
The update records all have the same StartDate. The update.dat source contains records with the following field values:
ProductCode | Cost | StartDate |
---|---|---|
F112 | 2425.00 | 20120701 |
J245 | 550.50 | 20120701 |
M447 | 101.75 | 20120701 |
S022 | 101.75 | 20120701 |
In IRI Workbench, there is a Voracity wizard to assist in the creation of scripts for updating Dimensional files and tables. This wizard is located in the Voracity dropdown on the navigation bar. First you pick the SCD type. Then the window where you select the sources that are used for processing the update is displayed. For Type 2, the target is normally the original master file or table.
With the next screen, you determine how the update data is mapped and how other field or column values are set. The update.Cost will map to the Master2.cost and the update. StartDate will map to the master2.StartDate. The Flag Field is the field that is used to determine the record with the active Cost. That is, whether the values in the record are active or historical. In this case, the Flag Field is the field Current. The Flag Positive Value is the value in Current that determines if the Cost is the current Cost; the value is “Y” in our example and the Flag Negative Value is “N”. End Field contains the name of the field that holds the value used to determine when the Cost for the record is no longer effective and Master.EndDate holds that value. End Value is used as the value for End Field when the record contains the current values for a ProductCode.
The next screen is for defining the join performed with the master and update sources. You should note that both sources must be ordered with respect to ProductCode. If they are not, then you need to select NOT_SORTED in the dropdown for Sort Order Option under the data source that needs to be sorted.
Here is the job script:
/INFILE=C:/IRI/CoSort95/workbench.orig/workspace/SCD/SCD2/master2.dat /PROCESS=DELIMITED /ALIAS=master2 /FIELD=(PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",", FRAME='\"') /FIELD=(COST, TYPE=NUMERIC, POSITION=2, SEPARATOR=",", FRAME='\"') /FIELD=(STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",", FRAME='\"') /FIELD=(ENDDATE, TYPE=ASCII, POSITION=4, SEPARATOR=",", FRAME='\"') /FIELD=(CURRENT, TYPE=ASCII, POSITION=5, SEPARATOR=",", FRAME='\"') /INFILE=C:/IRI/CoSort95/workbench.orig/workspace/SCD/update.dat /PROCESS=DELIMITED /ALIAS=update /FIELD=(PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",", FRAME='\"') /FIELD=(COST, TYPE=ASCII, POSITION=2, SEPARATOR=",", FRAME='\"') /FIELD=(STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",", FRAME='\"') /JOIN FULL_OUTER master2 update WHERE MASTER2.PRODUCTCODE == UPDATE.PRODUCTCODE /OUTFILE=master2.dat # Include only records that are being updated # Use the Cost and StartDate from the Update file /PROCESS=DELIMITED /FIELD=(MASTER2.PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",", FRAME='\"') /FIELD=(UPDATE.COST, TYPE=NUMERIC, POSITION=2, SEPARATOR=",", FRAME='\"') /FIELD=(UPDATE.STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",", FRAME='\"') /FIELD=(MASTER2.ENDDATE, TYPE=ASCII, POSITION=4, SEPARATOR=",", FRAME='\"') /FIELD=(MASTER2.CURRENT, TYPE=ASCII, POSITION=5, SEPARATOR=",", FRAME='\"') /INCLUDE WHERE MASTER2.PRODUCTCODE == UPDATE.PRODUCTCODE AND MASTER2.CURRENT EQ “Y" /OUTFILE=master2.dat # Change any current records that are being updated to history records # by giving the EndDate as the StartDate from the update record # and changing the field Current to N /PROCESS=DELIMITED /CONDITION=(MATCH, TEST=(MASTER2.PRODUCTCODE == UPDATE.PRODUCTCODE AND MASTER2.CURRENT EQ "Y")) /FIELD=(MASTER2.PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",", FRAME='\"') /FIELD=(MASTER2.COST, TYPE=NUMERIC, POSITION=2, SEPARATOR=",", FRAME='\"') /FIELD=(MASTER2.STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",", FRAME='\"') /FIELD=(ENDDATE_NEW, TYPE=ASCII, POSITION=4, SEPARATOR=",", FRAME='\"', IF MATCH THEN UPDATE.STARTDATE ELSE MASTER2.ENDDATE) /FIELD=(CURRENT_NEW, TYPE=ASCII, POSITION=5, SEPARATOR=",", FRAME='\"', IF MATCH THEN "N" ELSE MASTER2.CURRENT) /OMIT WHERE MASTER2.PRODUCTCODE EQ "" /OUTFILE=master2.dat # Add new records /PROCESS=DELIMITED /FIELD=(PRODUCTCODE, TYPE=ASCII, POSITION=1, SEPARATOR=",", FRAME='\"') /FIELD=(UPDATE.COST, TYPE=NUMERIC, POSITION=2, SEPARATOR=",", FRAME='\"') /FIELD=(UPDATE.STARTDATE, TYPE=ASCII, POSITION=3, SEPARATOR=",", FRAME='\"') /FIELD=(ENDDATE_NEW="99991231", TYPE=ASCII, POSITION=4, SEPARATOR=",", FRAME='\"') /FIELD=(CURRENT_NEW="Y", TYPE=ASCII, POSITION=5, SEPARATOR=",", FRAME='\"') /INCLUDE WHERE UPDATE.PRODUCTCODE NE "" AND MASTER2.PRODUCTCODE EQ ""
The new master file will have these values:
ProductCode | Cost | StartDate | EndDate | Current |
---|---|---|---|---|
C123 | 125.50 | 20110228 | 99991231 | Y |
F112 | 2425.00 | 20120701 | 99991231 | Y |
F112 | 2365.00 | 20120101 | 20120701 | N |
G101 | 19.25 | 20110930 | 99991231 | Y |
G101 | 21.25 | 20110501 | 20110930 | N |
J245 | 550.50 | 20120701 | 99991231 | Y |
J245 | 450.50 | 20110430 | 20120701 | N |
J245 | 425.25 | 20101001 | 20110430 | N |
J245 | 385.25 | 20100215 | 20101001 | N |
M447 | 101.75 | 20120701 | 99991231 | Y |
S022 | 101.75 | 20120701 | 99991231 | Y |
S022 | 98.75 | 20110515 | 20120701 | N |