Creating a Voracity Flow from the Palette (Part 2…
This is the second of two articles on creating a new IRI Voracity® ETL job flow from the job design Palette in the IRI Workbench, built on Eclipse. Next, we will add more components to the existing flow, started in part 1, where we:
- created a basic flowlet
- dragged a Transform Mapping Block into the flowlet
- defined the Input Data after dragging the ODBC icon into the Input Data block
- defined the Action after dragging the Sort icon into the Action block
- defined the Output Data using the Target Field Layout editor after dragging the File icon into the Output Data block
- eliminated the columns AGE, EMAIL, DRIVING_LICENSE, and SSN when mapping the table columns to the fields for the output file PatientInfoProtected
- applied a mask to the fields CREDIT_CARD1 and CREDIT_CARD2 that masked all except the last four digits of the fields
- created a data definition format (DDF) metadata file for PatientInfoProtected
We will resume now by adding a second Transform Mapping Block to our flow.
Drag the Transform Mapping Block icon from the Palette into the Scratch flow created in the previous article and double click in that block. For Representation name, type Collections Transform Mapping Diagram and click OK to open the Transform Mapping Diagram in the editor. This diagram has an empty Input Data block, Action block, and Output Data block.
Define the Input Data (Source)
The output file PatientInfoProtected from the CoSort SortCL transform mapping block PatientInfoProtect.scl will be one of the inputs for the next transform mapping block Collections.scl. Drag the Input File icon into the Input Data block.
The Get File Format Information window opens, and we select the DDF (data definition file) metadata file for the input. From the drop-down menu, select our project name ScratchFlow. From the list in the Matching items box, select PatientInfoProtected.ddf and click OK.
In the next window, make sure the path is to our Project. The file PatientInfoProtected is an interim file and will not exist until we execute the Flow. Type PatientInfoProtected for File Name and click Open.
In this case, we know that the input file PatientInfoProtected will be in our project directory because the previous step in this flow places it there. As long as that is true, it is not necessary to include the path as part of the file name.
Click in the Section Options block for PatientInfoProtected. In the Properties view, select the Base tab, and change Name to In_PatientInfoProtected. Then click inside the flow diagram again.
There are two inputs required to perform a join. Drag an Input ODBC icon into the Input Data block to open the Database Information dialog. Select the Data source name(DSN).
The Table selection box is then populated with the available tables for that DSN. Click the check box for the NIGHTLY.PATIENTBILLING table and click Finish.
The Input Data block now has the two inputs with fields and Section Option blocks inside each white data block.
Create a Record Filter
We need to match each IDNUMBER with a PASTDUE value greater than zero in the PATIENTBILLING table to an IDNUMBER in the PatientInfoProtected file. Drag the Filter Include icon from Section Options in the Palette to the Section Options block for the PATIENTBILLING table.
The Record Filtering dialog opens. Under Filter Action click Include, and under Condition Type, click New Condition. Click the f … button next to the field for New Condition.
This opens the Expression Builder dialog. There is a box at the top with a red X to the right of where our expression or condition will be built. When the expression syntax is valid, the X becomes a green check mark.
There is a Categories box below and to the left. For each category clicked, options associated with it appear in the Items box on the right.
Click on the category Field Names, then double-click the item PASTDUE. PASTDUE is now in the Expression box. Select the category Relational Operators and double-click GT (for greater than). The GT is now in the Expression box.
Now, type in a space and a 0 in the Expression box. This completes our expression and we have a green check. Click Finish and then OK.
The record filter is now in the Section Options block for the input table PATIENTBILLING.
Define the Join
We want to match the rows in the PATIENTBILLING table to records in the PatientInfoProtected file with the same IDNUMBER; then send those matched records to the target table. This is an inner join.
Drag the Join icon from the Transform Action in the Palette into the Action block. The Join Sources window opens.
For the join to execute correctly, the inputs must be in order by the field to be matched. We know Data source 1 will be in order because the data was sorted in the previous step of the flow. Data source 2 will be in order by IDNUMBER because the order is maintained in the table.
If a source is not in order, we can select NOT_SORTED from the Sort order option for that source and CoSort will perform the sort with the join. The sources are assumed to be sorted by default.
Make sure Inner join is selected under Join type. Click IDNUMBER in both sources; then select Create Condition. The condition for matching the IDNUMBER is written into the Join conditions box. Click Finish.
There is now a white Join block in the Action block. Contained in that Join block is a gray field block with all the fields from each input and an orange Join Condition block with the condition defined in the Join Sources window. The joined fields acquire the naming convention of input_name.field_name so you know the source of each field.
Define the Output (Target)
The join we just defined will map fields to columns in the target table COLLECTION. Drag an Output ODBC icon into the Output Data block to open the Database Information dialog. Select the DSN from the drop-down.
Click the check box for the table NIGHTLY.COLLECTION and click Finish. Fields from the Action block are mapped to corresponding field names in the COLLECTION table.
Eliminate Double-Mapping
When doing the join, the field IDNUMBER was used to match the records from the two inputs and thus exists in both. By default, both fields are mapped into output from the Action block to allow for more granular mapping control of source fields.
However, since only one is normally needed, remove the blue Mapping Connection from one of the sources. Click the connection arrow that goes from the PATIENTBILLING.IDNUMBER in the Action block to IDNUMBER in the Output Data block and press delete.
Match Field Names to Column Names
Sometimes when your flow is processed, a final field name may not match the column name in the target table. The two fields MASK_CREDIT_CARD1 and MASK_CREDIT_CARD2 from the file PatientInfoProtected should map from the Action block to the columns CREDIT_CARD1 and CREDIT_CARD2 in the target table COLLECTION.
There are two conventions used for naming fields: one is the actual Field Name and the other is called Ext Field. The Field Name is used when processing the job script and the Ext Field is the column name in the table where a field value should load.
We, therefore, need to change the field names CREDIT_CARD1 and CREDIT_CARD2 in the target fields block to match field names in the Action block. Double-click on each field in turn. This will open the Target Field window where we can change the Field Name. Change each Field Name to MASK_CREDIT_CARD1 and MASK_CREDIT_CARD2 respectively.
Ext Field is found in the Layout tab of this window and contains the column name of the table where the value for the field will load. No change is needed for Ext Field since it matches the column name in the table COLLECTION. Click OK.
Now, create the new Mapping Connection from the Action block to the Output Data block for the fields MASK_CREDIT_CARD1 and MASK_CREDIT_CARD2. For each field, click the Mapping Connection icon in the Palette.
Cick on the field in the Action block, and then on the field in the Output Data block. The Collections Transform Mapping Diagram is now complete.
Give the Job a Name
Click in the area outside the blocks for the overall Properties view and select the Base tab. Change both the Name field and the IRI Job field to collections.scl; then click in the diagram. The first field names the Transform Mapping Block represented in the above diagram, and the second field names the SortCL Job script that will be created from that diagram.
Go back to the Scratch flow diagram in the editor. Connect the outfile from the PatientInfoProtect.scl block to the PatientInfoProtected input file in the collections.scl block. Below is the complete Scratch flow diagram.
Serialize the Job for Execution
To create the job scripts PatientInfoProtect.scl and collections.scl, and the batch file Scratch.bat, right-click in the editor > IRI Diagram Actions > Export Flow Component. Select Windows from the Platform drop-down and verify that Scratch.bat is in the File name field.
Scratch.bat is the name of the file that will execute the flow. Click Finish. The CoSort SortCL job scripts and the batch file are created and placed in the project.
If you have other commands that need to be executed in the batch script, see Creating Voracity Flows Using Existing IRI Scripts (Part 3). This covers the use of Decision and CommandLine blocks. Note that there are other blocks available, including blocks for SQL and FTP commands, IRI FACT (fast extract), and bulk DB load blocks for several loaders.
Run the Job
You can choose to run the individual tasks or the whole batch script in IRI Workbench through the:
- Run menu (green arrow icon) on the navigation bar
- Run As menu (right-click on any .scl, or .bat / .sh script in the project explorer)
- built-in task scheduler
You can also run the tasks outside IRI Workbench on the command line or with any third-party job scheduler, like Universal Automation Center.
Here is the batch script Scratch.bat:
@echo off sortcl /SPECIFICATION=PatientnfoProtect.scl sortcl /SPECIFICATION=collections.scl
And here is the script in the SortCL language for the transformation collections.scl:
/INFILE=C:/IRI/CoSort95/workbench/workspaceFlow2/ScratchFlow/PatientInfoProtected /ALIAS=PatientInfoProtected /PROCESS=RECORD /FIELD=(IDNUMBER, TYPE=ASCII, POSITION=1, SEPARATOR="\t", EXT_FIELD="IDNUMBER") /FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR="\t", EXT_FIELD="NAME") /FIELD=(PHONE, TYPE=ASCII, POSITION=3, SEPARATOR="\t", EXT_FIELD="PHONE") /FIELD=(MASK_CREDIT_CARD1, TYPE=ASCII, POSITION=4, SEPARATOR="\t", EXT_FIELD="CREDIT_CARD1") /FIELD=(MASK_CREDIT_CARD2, TYPE=ASCII, POSITION=5, SEPARATOR="\t", EXT_FIELD="CREDIT_CARD2") /FIELD=(CITY, TYPE=ASCII, POSITION=6, SEPARATOR="\t", EXT_FIELD="CITY") /FIELD=(STATE, TYPE=ASCII, POSITION=7, SEPARATOR="\t", EXT_FIELD="STATE") /INFILE="NIGHTLY.PATIENTBILLING;DSN=Oracle_qa2;UID=nightly;PWD=N321ghtly;" /ALIAS=NIGHTLY_PATIENTBILLING /PROCESS=ODBC /FIELD=(IDNUMBER, TYPE=ASCII, POSITION=1, SEPARATOR="\t", EXT_FIELD="IDNUMBER") /FIELD=(PASTDUE, TYPE=NUMERIC, POSITION=2, SEPARATOR="\t", EXT_FIELD="PASTDUE") /FIELD=(CURRENTDUE, TYPE=NUMERIC, POSITION=3, SEPARATOR="\t", EXT_FIELD="CURRENTDUE") /FIELD=(BALANCE, TYPE=NUMERIC, POSITION=4, SEPARATOR="\t", EXT_FIELD="BALANCE") /INCLUDE WHERE PASTDUE GT 0 /JOIN INNER PatientInfoProtected NIGHTLY_PATIENTBILLING WHERE PATIENTINFOPROTECTED.IDNUMBER == NIGHTLY_PATIENTBILLING.IDNUMBER /OUTFILE="NIGHTLY.COLLECTION;DSN=Oracle_qa2;UID=nightly;PWD=N321ghtly;" /PROCESS=ODBC /FIELD=(IDNUMBER, TYPE=ASCII, POSITION=1, SEPARATOR="\t", EXT_FIELD="IDNUMBER") /FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR="\t", EXT_FIELD="NAME") /FIELD=(PHONE, TYPE=ASCII, POSITION=3, SEPARATOR="\t", EXT_FIELD="PHONE") /FIELD=(MASK_CREDIT_CARD1, TYPE=ASCII, POSITION=4, SEPARATOR="\t", EXT_FIELD="CREDIT_CARD1") /FIELD=(MASK_CREDIT_CARD2, TYPE=ASCII, POSITION=5, SEPARATOR="\t", EXT_FIELD="CREDIT_CARD2") /FIELD=(CITY, TYPE=ASCII, POSITION=6, SEPARATOR="\t", EXT_FIELD="CITY") /FIELD=(STATE, TYPE=ASCII, POSITION=7, SEPARATOR="\t", EXT_FIELD="STATE") /FIELD=(PASTDUE, TYPE=NUMERIC, POSITION=8, SEPARATOR="\t", EXT_FIELD="PASTDUE") /FIELD=(CURRENTDUE, TYPE=NUMERIC, POSITION=9, SEPARATOR="\t", EXT_FIELD="CURRENTDUE") /FIELD=(BALANCE, TYPE=NUMERIC, POSITION=10, SEPARATOR="\t", EXT_FIELD="BALANCE")
When the batch file is run, the two job scripts are executed in sequence. Output data from the PatientInfoProtect.scl script is one of the sources for the join script collections.scl. The output of the second script is loaded into a table called COLLECTION.
This table will not have four of the original columns in the PATIENT_RECORD table. Additionally, the two credit card columns have all but the last four digits masked. Also, only accounts with PASTDUE amounts greater than zero go into the COLLECTION table.
If you have any questions or need help building your ETL flows in Voracity, contact voracity@iri.com.