Automating IRI Jobs Using File Monitoring: A POC
Manually initiating SortCL-compatible jobs in IRI Voracity ETL, CoSort reporting, FieldShield masking, or NextForm migration scenarios is not realistic or productive in environments where data in sources are being added or changed dynamically. By contrast, real-time job automation eliminates the need for manual invocations, and ensures the right jobs run in a timely manner.
In this post, I walk through a proof of concept (POC) that automatically executes an existing SortCL-compatible job script based on the name of a file when a new file is created or data is added or changed in an existing file.
Note:
- This article is technical in nature and requires a basic understanding of events and scripting languages.
- IRI has also developed an analogous, log-based solution to trigger SortCL jobs upon real-time changes to data in relational database tables called Ripcurrent; see the series of articles on point starting here.
File Monitoring Overview
File monitoring is generally used for determining if a file has been created, changed, renamed or deleted within a directory. The two main ways that can be used for determining changes with a directory are polling and event-driven.
Polling determines directory file changes by periodically getting a list of file information in a directory and comparing it to a list that was cached earlier.
The event-driven approach is where the operating system or subsystem provides event notification to an application when a change in a directory’s content occurs. The event-driven approach is usually more efficient as it doesn’t require constant execution of code to determine something has changed.
This post is focused on the event-driven approach using PowerShell 5.1 and the Microsoft .NET FileSystemWatcher Class on the Windows platform. When it comes to file monitoring, this is not the only one of the options. Some other examples are Java’s Watch Service and Linux inotify.
Many of the concepts discussed in this post apply, regardless of development language or platform. While I haven’t tested it, PowerShell is also available for Linux and macOS as part of Open Source .NET Core. As you would expect, not all commands available on Windows are available on Linux.
POC Conceptual Design
Beyond the automation of SortCL script execution, the conceptual design includes some basic ways to accomplish functional modularity and scalability. A production implementation may result in expanding the number of servers, watchers, and/or folders to meet performance and security requirements.
The conceptual design is depicted below. A high level narrative of the design follows. More details are provided in the implementation section below.
The Data Source, Staging Server and Job Server represent different logical computers/servers..
- A user or some automated process copies a file to or updates an existing file in the Staging folder on the Staging Server.
- A Watcher running on the Staging Server registers to be notified when any file has been created or changed in the Staging Folder. The Watcher creates a job trigger file in the Trigger folder on the Job Server.
- A Watcher running on the Job Server registers to be notified when a job trigger file has been created or changed in the Trigger folder. The Watcher reads the job trigger file and initiates a SortCL job script to process the file located in the Staging Folder (e.g. sortcl.exe /spec=maskEmployees.scl).
POC Physical Design
The physical design of the POC implements the components depicted in the conceptual design above on a single computer running Windows 10. Separate folders were created to represent the servers and the folders as subfolders.
All components used for the POC are depicted in the diagram below:
In the next section, we will look at how the pieces work together and the associated PowerShell script logic.
Before we do that, let’s look at how the physical design maps to the conceptual design above.
- The Data Source from the conceptual design is represented by the DS folder. Data used to test the POC is contained in the Data folder.
- The Staging Server represented by the SS folder contains the following:
- FileWatchStaging.ps1 file is the PowerShell script that represents the Watcher of the conceptual design.
- fileWatchStaging.config file, in the config subfolder, externalizes the name of the folder to be watched and several timing factors.
- fileActions.csv file, in the config subfolder, contains the location and name of the SortCL script to be executed for each supported file. In the case of the POC that is the employees.csv and patient_record files.
- Staging folder is the folder the Watcher registers to be notified of file events.
- The Job Server represented by the JS folder contains the following:
- FileWatchTrigger.ps1 file is the PowerShell script that represents the Watcher of the conceptual design.
- fileWatchTrigger.config file, in the config subfolder, externalizes the name of the folder to be watched and several timing factors.
- Trigger folder is the folder that will be watched, and the files written there contain the information needed to initiate a SortCL job script.
As you can see, the design has two file watchers with one being responsible for creating a job trigger for file changes in the Staging folder, and the other watcher for initiating the execution of a SortCL job script for trigger files written to the Trigger folder.
A common requirement of both Watchers is to look for created and changed file events and then take some action. Looking for changed events is not only because a file was changed, but also because copying a file to a folder will typically result in a created event and one or more changed events. In my testing, the number of events depended on the size of the file.
Handling the multiple file events requires the added complexity of introducing a time delay into the design. This ensures that file processing is completed prior to writing the trigger file for subsequent processing.
There are other decisions I made in the implementation to ensure file integrity that may or may not meet your specific needs. These are covered in the next section.
Implementation
As mentioned earlier, the POC was created to demonstrate an approach for automating the execution of a SortCL job script based on a file being copied or updated.
I can’t stress enough that this was a proof of concept and needs enhancements to be ready for a production environment. More on that at the end of the post.
Now let’s walk through the process and code from end-to-end. If you are not a PowerShell developer, the code comments should aid your understanding.
All of the code and configuration files used in the POC can be downloaded here.
Please note, that in order to format the script code for viewing in the post, some of the code is broken across multiple lines. Therefore, it is recommended to use the files from the download and not any of the code included in this post.
- To start the process, the Staging and Trigger PowerShell Watcher scripts need to be started. Since the standard PowerShell
execution model processes synchronously, these scripts must be started in separate PowerShell command windows. To accomplish asynchronous processing within the scripts, they can be enhanced to use jobs
that run in the background (i.e. Start-Job).
- When the Watcher scripts are first started they perform a number of initialization steps in preparation for processing file events.
This includes:- Reading and caching configuration information.
- Reading and caching file action information (Staging Watcher only).
- Creating an instance of Windows File System Watcher class and setting properties.
- Registering for created and changed events.
- Creating an instance of a Timer and setting properties that upon expiration determines if a file is ready to process.
If you recall from above, this is to handle the occurrence of multiple events when a file is copied or written to the directory.
The initialization code is immediately below and is common between the watchers with three differences: The Trigger Watcher PowerShell script doesn’t contain the code to
process the fileActions.csv, has a different configuration file name, and the file filter is set to *.trigger instead of all files (*.*). The content of the configuration and file actions files can be seen in
Figures 7, 8 and 11.
After initialization the script loops waiting for events. When the script is stopped, the finally block cleans things up.
- Once the Watcher Scripts are running, we can copy the employees.csv file from the Data Source Data folder to the Staging folder.
- When the copy process starts, the FileWatch Staging script receives one created event notification and four changed event notifications by the time the copy finishes. The number of changed events will vary depending on the size of the file.
To handle multiple events, the processing is as follows:
- For the created event, information about the file is added to the file event cache. This code gets executed because it was specified as the -action parameter during the registration process for the event handlers in 2 above.
- For each changed event, the event time is updated in the file event cache. This code gets executed because it was specified as the -action parameter during the registration process for the event handlers in 2 above.
- At startup the FileWatch Staging script sets a timer that expires based on the millisecond setting in it’s configuration file (timerWaitMil). When the timer expires, the cached file information is checked for any files that have not been updated for more than the number of seconds specified in the configuration file (processWaitSec).
- If the current time minus the last file event exceeds the process wait seconds, the Staging Watcher looks up the file name in the cached actions (Figure 8), creates a job trigger file (Figure 9) in the directory specified by the triggerPath attribute and deletes the file event from the cache.
The job trigger file is named using the file name with a date and job number appended to the end (FileName.yymmdd_jn.trigger where n is the job number).
Each time a file is processed by the script the job number is incremented. For example, if the patient_record file is the next file copied to the Staging folder the trigger will be named patient_record.211122_j2.trigger
File Actions Configuration
Trigger File
Here is the code that performs the logic above:
- Once the job trigger file is written, the FileWatch Trigger script receives one created event notification and one changed notification event for the job trigger file created above. To handle multiple events the processing is as follows:
- For the created event, information about the file is added to the file event cache. The code executed is the same as in 4a.
- For the change event, the event time is updated in the file event cache. The code executed is the same as in 4b.
- At startup, the FileWatch Trigger script sets a timer that expires based on the millisecond setting in it’s configuration file (timerWaitMil). When the timer expires, the cached file information is checked for any files that have not been updated for more than the number of seconds specified in the configuration file (processWaitSec).
- If the current time minus the last file event exceeds the process wait seconds, the Watcher reads the trigger file and uses the information to create some Windows environment variables and execute the SortCL job script.
- SortCL executes the maskEmployees.scl (Figure 12) script using the environment variables set by the FileWatcher Trigger PowerShell script for jobFullName and jobFileName.
It is important to make sure if the file format changes to update the SortCL script.
- When the SortCL job script has completed the output masked file will have been written to the TestData directory (Figure 13), the job trigger file is deleted, and the FileWatch script writes an entry to the Windows event log (Figure 14). The event log entry provides an audit trail linking the original file to the job and processId to the SortCL job script audit file (Figure 15).
Here is the code that performs the logic above:
Multi-File Processing
More than one file can be copied at the same time to the staging folder. The figures below show the contents of the directories during processing. After processing is completed, the job trigger files are deleted.
Files Copied from Data to Staging Directory
Trigger Files Created
Files Created by Running the SortCL Script (Figure 12)
Updated File Processing
If records are updated to a file located in the staging folder, another trigger file will be created and the masking SortCL job script (Figure 12) will run again. As shown in Figure 16, the new trigger file ends with _j3 (job 3) since _j1 and _ j2 triggers were created in the multi-file processing above.
Trigger file created from file change
Considerations for Production Readiness
Again, I can’t stress enough that this post is about a proof of concept and needs enhancing prior to any production use. The following list is some initials thoughts on enhancements:
- Enhance to meet your performance and security requirements. This could include:
- Expanding the number of servers, watchers and folders to provide more processing parallelism.
- Creating additional folders to simplify security. (e.g. Staging and Output folders for HR, Customer, Claims).
- Add error checking and exception handling, as the POC watcher scripts only covered the happy path.
- Refactor to reduce redundant code. At a minimum, move the common code to a separate file and convert to functions.
If you are familiar with object oriented development, the Staging Watcher and Trigger Watcher can be streamlined into classes and inherit a base Watcher Class containing the common code.
Refactoring will result in reduced code and lower cost of maintenance.
- Remove hardcoding for the job counter reset and add it to the staging watcher configuration file.
- Create a script to delete or archive the files.
My last piece of advice is to test, test, and test again. Don’t forget volume and negative testing.
Summary
In this post, I discussed the benefits of automating SortCL job script executions for real-time file system events. I then walked through a conceptual design and implementation that runs these jobs when files are created or updated in a directory being monitored.
The approach demonstrated in this post should give you a good start on creating your own automation effort. Another post will cover real-time SortCL job triggering from database updates.