Sharing IRI Data Management Jobs via Git
Editor’s Note: This article updates IRI’s original series on managing metadata assets in Git, and covers project export and import, particularly for use in enterprise data anonymization and test data management scenarios. A future article will cover the use and integration of Git’s Large File System (LFS) for provisioning big Voracity job data targets like CoSort-wrangled subsets or RowGen test files.
Abstract
Maintaining consistency and control of data processing jobs are essential in multi-application, multi-user production and development environments which rely on data and referential integrity, golden copies of master data and test data, and the reusability of complex tasks. This article describes how IRI Voracity data management, protection, and prototyping projects – and the common data class and function rule artifacts used in multi-source data masking and synthesis jobs – can be shared with enterprise users through Git architecture. Packaged and integrated in the graphical job design IDE built on Eclipse called IRI Workbench, Git can submit, store, and provide these artifacts securely to manage their access and change.
Introduction
IRI software is architected to run in distributed fashion on a client-server network, where most jobs are designed in IRI Workbench and executed by the backend IRI executable called SortCL. Unlike a web application or cloud-based SaaS, Workbench and SortCL can run anywhere on your network, private cloud, or public cloud infrastructure (which you, not we, manage). SortCL can also run on the same system (Linux, Windows or MacOS) running Workbench, too.
Beyond this flexibility, and the independent operation of IRI client and server components, comes the security of maintaining full control of the software, your data and infrastructure. This is an especially valuable paradigm in the data breach and privacy law compliance era, and one that a SaaS solution cannot provide. Similarly no affordable web application can match the data discovery, integration, migration, governance, and analytic functional breadth of Voracity.
That said, cloud-based SaaS and/or web applications offer remote, centralized access to the job design and asset environment, lending a (at least perceived) level of convenience in configuring jobs and hardware resources. The question then becomes, “How can I have the best of those worlds with a software platform like Voracity where my development and production jobs are distributed in my own domain(s) for direct control, wider capability, and regulatory compliance?”
SaaS / Web-like Deployment without the Risk / Costs
Given the considerations and question above, IRI has these suggestions:
1. Operate on-premise or in your private/public cloud infrastructure, and/or on one or more VMs, where Workbench and SortCL may even be collocated. Log-in remotely/centrally to IRI Workbench workspace(s) via VNC, RDP, or web browser (virtual desktop) for web-app-like convenience but with an encrypted connection to a more private node;
Operating IRI Workbench (Voracity IDE) via RDP in Chrome Browser
2. License Voracity like a SaaS, via subscription, to leverage OpEx pricing, multi-node deployment, multi-discipline capabilities, and included technical support … but without CSI-driven source, volume, core, or I/O metering or SSO security/compliance issues;
3. Share projects and job artifacts securely through Git, whether you operate in the cloud (see above), or on-premise, following the steps in this article below. By submitting these artifacts to an online or LAN Git repository integrated directly with your Workbench, you can assign access rights, and later (or have others) download job assets for (re)use.
IRI Workbench – Git Use Prerequisites
Whether you are a project creator, a project user, or both, you must connect your IRI Workbench instance to a remote Git repository (which you created or need access to). Also to use Git, you also need a local Git repository in your file system with which the remote repository (e.g., GitHub) can interact.
If you are using SSH with your remote Git repository, make sure that the SSH key is accessible (to add) to the Workbench (see the steps on adding an SSH key further below). Once you have completed these tasks, connecting Workbench to Git is very simple.
Before going over the steps to connect Git to the Workbench, it is important to understand what a workspace is. Workbench uses workspaces to store information such as projects, files, and even database connections. This means you can have several workspaces that are completely different from one another but use one Workbench (which is the actual application).
By way of further clarification, your local Git repository is just a directory in your file system. A workspace is the same thing; it too is just a directory where projects are stored. If you want to create projects and have them ready to push to the remote repository then you need to use the local repository.
And, if you use the local repository as your workspace, your IRI Workbench projects will already be in the local repository, so it will be easier to simply push (or pull) changes through Git that way. If you do not use the local repository as the workspace location, that’s OK, but you will then have to move those projects to the local repository in order to share them through Git; that option is more prone to errors due to full file pathing conflicts; see the recommendations below.
Recommendations
We recommend that you connect your Workbench to Git before creating or using any projects. You can run into issues if you create projects first and then connect Workbench to Git, such as broken rules or scripts because they might use a full path to a file and changing its location can cause the job scripts to fail.
Another recommendation is to not use the default workspace location for your local Git repository. The default location is located inside of the Workbench (shown here right) and can cause issues in the future if you need to delete Workbench for any reason but want to keep your workspace that contains all your projects.
New Workspace
If you have not created any projects and are just starting to use Workbench, the easiest option is to change the default workspace to your local Git repository. There are two options to change your workspace.
In the first option, when you first start Workbench, a dialog will appear asking you which workspace you would like to use. On the right side of the dialog, there is a browse button that will allow you to select a different folder with your workspace. Navigate to your local repository, click Select Folder, and click Launch to open Workbench using that workspace.
The second option opens inside Workbench. From the File menu, select Switch Workspace (the third option from the bottom). Any workspace you have used with Workbench will appear as an option. To open another workspace for the first time, select the option Other…
After selecting Other… the same dialog from the first option will appear allowing you to select another workspace. Now that Workbench is using your local repository as the workspace; any projects and files created will be available to push to your remote repository.
Adding an SSH Key
IRI Workbench uses SSH keys to connect and adhere to permissions set by the owner of the repository 1. The owner of the remote repository can make the repository public for anyone to access, or make it private and control who has read or read-write permissions for the whole repository or specific projects or files. If you do not have an SSH key to access the repository, you will not be able to connect let alone make changes to the remote repository.
Since data classes and their rules are saved to a file, the owner can give users read-only permissions to these files so anyone in the team can use the files for their jobs but they can’t push changes to the repository.
Of course, those with read-write permissions can use the data classes and rules, make changes, and share those changes with the team that is using the remote repository.
For Workbench to adhere to the permissions, we need to add the location of the .ssh folder and the SSH key that relates to the repository. Inside Workbench, click on the Window tab at the top and select Preferences.
The preferences dialog will open and in the search text box type SSH. This will filter out the options and show you SSH2 which is where we need to add the keys.
To designate SSH2 home, browse to your .ssh folder located by default in the users folder. Once you select that folder, add the private key used for the repository. Next to the private keys label there is a button called Add Private Key… that will browse your .ssh folder for the keys that are available. Once you have selected the private key click Apply and Close to save the changes to Workbench.
Git Perspective in Workbench
IRI Workbench includes Git functionality so you can run Git tasks without having to use the command line or Git bash. All you need to do is open the Git perspective inside of Workbench.
At the top right corner, click the button called Open Perspective next to the IRI default perspective (see image below).
The Open Perspective dialog will appear and present all the perspectives available with Workbench. Select Git and click Open. You will then have two perspectives available next to the Open Perspective button: the default IRI perspective and the Git perspective. To change from one perspective to another, click on its icon.
Once the Git perspective is open, in the far left section of Workbench there are three options to add a repository. Since you already created a local repository, click Add an existing local Git repository option.
A dialog called Search and select Git repositories on your local file system will appear asking you to browse for the directory of your local repository. Select your local repository and OK. At this point, the Search results section shows which Git repository is available in this directory.
Check which Git repository you wish to add and click Add at the bottom of the dialog. This will connect Workbench to your local repository.
Now you can see any changes in the repository and run commands like pull and/or push if you have read and/or write permissions on the remote repository. The image below shows Workbench connected to a Git repository and with some changes in projects that can be pushed to the remote repository.
Previously Created Workspaces
If you already have projects and wish to have them in your repository, you will still need to follow the instructions from the previous section on Git Perspective. After following those steps, your Workbench will be connected to the repository but projects that you already have will not be associated with the repository.
To have your existing projects become part of the repository, you will need to move those projects from where they are stored in the file system and put them into your local repository.
Move Your Projects to Your Local Git Repo
To move existing / previous projects to your local Git repository, select the IRI perspective to see the Project Explorer which contains all of your projects. Right-click any project you wish to move and a menu will appear. Hover over the Team option and a small menu will appear on the right side. Select Share Project… and a dialog called Configure Git Repository will appear.
In that dialog, you will need to specify the repository to which you want to move your project. Since Workbench is already connected to the local repository, click the drop-down menu next to the Repository label, and your local repository should appear as an option.
In the middle of the dialog, you should see your project, its current location, and the target location (where your local repository is). Click Finish to move your project to your local repository.
At this point, your Workbench is connected to your Git repository and you now know how to move your project from your workspace to your local repository. The last thing you need to do is to change the workspace of your Workbench to your local repository.
If you were to create a new project without changing the workspace to your local repository, then all your new projects will not appear in your repository and you would have to manually import them over.
Before changing your workspace, ensure that all the projects you want are imported into your local repository. Once that is completed follow the steps from the New Workspace (second option) which goes over how to change from one workspace to another.
Now you should have your repository connected to Workbench, your projects imported into your repository, and your local repository is your current workspace.
Push (Upload) Projects/Changes to the Shared Repository (Hub)
Once you created any new projects or edited existing projects, you can push those changes up to the remote Git repository (e.g., GitHub) if you have write permissions for the remote repository. Go to the Git perspective inside of Workbench and Select the Git Staging tab:
Inside the Git staging tab, you can see the unstaged changes, the staged changes, and to the right of them the commit message. Select the green plus sign to move all or one of the files to the staged changes section.
Add a commit message and select Commit and Push to send your changes to the remote repository or Commit to save the change to the repository.
Import Project(s) from a Shared Git Repository (Hub)
To import projects or their artifacts (job scripts, data class mappings, etc.) from a shared Git repository into your workspace, open the Git perspective again. In the far left corner under Git Repositories you will see the Git branch connected to your Workbench. If you do not have a repository showing, see the Git Perspective section of this blog.
Ensure that your local repository is up to date with the remote repository. To do this, right-click on the repository and select Pull from the menu that appears. This will update your local repository with any new or edited projects.
Now to import the projects, right-click on the repository and a menu will appear. Click on the Import Projects option. A dialog called Import Projects from File System or Archive will appear to automatically check the projects in the repository, and confirm they are in your workspace.
The section in the middle of the dialog will allow you to either import all the projects or to select individual projects to be imported into your workspace. Click Finish and all the projects will be imported and ready to be used by Workbench. If the finish button is not enabled (grayed out), that means that all the projects that are located in the repository are also in your workspace
(so there is nothing to import).
Benefits of Connecting
There are major benefits to connecting Git with IRI Workbench, even if you never plan to share your files and projects with someone else. The first benefit is being able to revert changes. IRI data classes and rules library are now stored in one file (iriLibrary.dcrlib).
This file interacts with several wizards within Workbench and an accidental deletion of this file can cause you a lot of work recreating custom data classes and rules. With Git you’ll be able to simply restore prior changes, or back off errant changes done, to the project.
Another major benefit is being able to easily share projects, database connections, job scripts, etc. Data governors can define data classes with the rules that are needed and share them with your team and give certain users permission to read and write to the files or just read. You’ll be able to see any changes, who committed them, and if authorized, use version control to revert a project to a prior version before any commits.
Finally, and as importantly, when sharing IRI data masking projects using Git, team members anywhere in the world can be sure PII will be found and remediated consistently using the same data classes and masking rules. This consistency is central to maintaining data and referential integrity in either production or test environments across data sources and silos. And multi-silo discovery reliability is also required to comply with GDPR data erasure, portability and rectification guarantees.
- Note that the project owner, or data governor responsible for the project rules, creates the keys and defines the repository permissions in Github. IRI Workbench does not manage those permissions, but must present the SSH key to identify the user so GitHub can determine what someone can (or cannot) do.