Consistent, Self-Updating and Secure Pseudonymization
What is Pseudonymization?
Pseudonymization is a data masking method involving the replacement of one or more original source values in a column in a table, file, or in free-floating text with another, usually consistent “synthetic” value. Pseudonymization is used most often for anonymizing sensitive values like names and complies with the GDPR when the pseudonyms cannot be reversed.
IRI data masking tools like FieldShield and DarkShield use lookup values that may or may not be reversible. For the pseudonyms to be consistent and/or reversible, however, the original values must be paired with a list of replacement values in a two-column set file called a pseudonym replacement set (or pseudo set) file.
The pseudo set file is used to consistently replace a source value with a replacement value whenever the source value is the same as a value in the lookup column of the pseudonym replacement set file. This is how consistent pseudonymization replacement occurs.
Limitations of Normal Pseudonymization
As stated previously, consistent replacement requires a lookup list. The reliance on a lookup list to find matches places limitations and restraints on pseudonymization. This is because for any number of unique names present in your source’s table or file they must also be present among the lookup list, lest the event of no match found occurs.
In the case where no match is found, a default or empty string will be returned for the replacement value. The problem with this is that you have several values matched to a single default replacement value.
It is, for this reason, it is suggested to frequently update the pseudonym replacement set file if you have a column of names that frequently receive new entries.
Failure to update lookup sets when LAST_NAME column receives new entries.
Pseudonym Hash Replacement Rule
Aware of the shortcomings of normal pseudonym replacement, IRI FieldShield now supports a new more flexible, and secure form of pseudonym replacement, in the form of a Pseudonym Hash Replacement Rule. This rule also uses a lookup list to find exact matches and replace a value when a match occurs from a list of replacement values.
However, unlike normal pseudo replacement, the new rule uses hashed values as the lookup values. These lookup values are used to either find exact matches or the next closest match.
The Pseudonym Hash Replacement Rule is considered flexible because it can consistently map multiple values to a single replacement value. This many-to-one relationship releases users from the tethers of having to continuously update and maintain their pseudonym replacement set files whenever more unique entries are added to their source’s columns in tables or files.
Another point to note is the increase in security this new rule brings. Normally, a pseudonym replacement set file contains a column with a lookup list that matches against values in a table or flat files. This means that the values in the lookup list may have real names.
In certain situations, it may be necessary to secure the set file containing the lookup list to prevent the leak of PII from the lookup list. A pseudonym replacement set file with hashed values for the lookup is secure because any potential PII has already been hashed.
Also, unlike encryption which supports decryption (reversal of encryption), hashing is typically irreversible. Only under certain circumstances could it be possible to determine the original value, such as if the input length is small enough and by comparing hashed values to known hashed values of strings.
How Hash Lookup and Replacements Occur
Before the original source value of a table or flat file is matched against the hashed lookup values in the lookup list of a pseudonym replacement set file, the original source value must first be preprocessed.
Using the same type of hashing function applied to the lookup values of the pseudonym replacement set file, the original source value is hashed. Then, the new hashed value is used to search for a match in the lookup list.
By providing the SEARCH parameter of LE (less than or equal) or GE (greater than or equal) in the IRI data masking program, occurrences, where exact matches do not occur in the lookup list, can be handled (pseudonymized).
Example One
Example Two
How to Create a Pseudonym Hash Replacement Rule
To create a Pseudonym Hash Replacement Rule we can use the new wizard in IRI Workbench.
Steps:
1. In the IRI Workbench click the down arrow next to the blue icon designated as the IRI Menu. After the dropdown menu expands click New Rule.
2. From the new window select Field Rules and click Next.
3. Select Automatic Replacement for New Original Values from the displayed tree view inside the Pseudonym Replacement directory. Then provide the project’s location and desired name for the new rule.
Once finished, click Next to open a wizard that will be used to generate the new rule.
4. In the newly opened wizard. The user can now start selecting options that will be used in the new rule. The user will need to select a hash function, the search type, and a two-column pseudonym replacement set file that has hashed values for the lookup list.
IRI currently supports three different hash functions, MD5, SHA1, and SHA2. It is important that the hash function chosen for the Pseudonym Hash Replacement Rule is the same hash function used on the lookup values in the hashed pseudonym replacement set file.
A search type must also be specified. Depending on the selection the replacement value results will vary when an exact match to the lookup column does not exist.
5. Like a normal Pseudonym Replacement Rule, the Pseudonym Hash Replacement Rule relies on a two-column set file for the lookup values list (this list needs to be in a hashed format for the Pseudo Hash Replacement Rule) and the replacement values list.
If the set file has already been created, click the Browse button to select the file. If a file has not been created yet, click Create.
In this case, the Pseudonym Hash Set File Creation Wizard will open and allow you to create the necessary set file. At completion, a path to the set file will be provided.
For a walkthrough on how to use the Pseudonym Hash Set File Creation Wizard, follow steps #3-13 in this article.
6. If the user is satisfied with their selections they can then click Finish to generate the Pseudonym Hash Replacement Rule in a Data Class Rule Library of their selected project.
Results
A Pseudonym Hash Replacement Rule is actually a special type of rule called a Linked Chain Rule. This is a rule that actually has two or more rules linked together. The individual links of a Linked Chain Rule can be located in separate sections of a job script.
In the case of the Pseudonym Hash Replacement Rule we have a link in the chain applying a hashing function in the /INREC section of a script and a Pseudonym Replacement Rule is applied in the /OUTFILE section of the script.
The Pseudonym Hash Replacement Rule creates a two-part rule that is chained together.
Once the rule is created, it can then be applied like any other rule via the SortCL job script editor in IRI Workbench or through a Fieldshield masking job.
A FieldShield job script has been generated.
In the image above, there is a FieldShield job script that has been generated with the Pseudonym Hash Replacement Rule applied to a field called FIRST_NAME. Because the Pseudonym Hash Replacement Rule consists of multiple rules linked together, they modify fields in both the /INREC and the /OUTFILE sections.
It is easy to determine which rules are links in a Linked Chain Rule by looking for the keyword LINK_{NUMBER} in the field names. If you have any questions about this, please email fieldshield@iri.com.
In Closing
With this new process of producing pseudonym replacements, users are no longer burdened with the continual maintenance and expansion of their pseudonym replacement set files.
Furthermore, unlike in normal a pseudonym replacement file where the lookup list could possibly display sensitive values, a lookup list with hashed values prevents gleaning any meaningful data from the lookup list.