Finding PII Using Data Matchers
As we’ve learned from this article on Data Classification in IRI Workbench (as of DarkShield V5), the types of PII, or classes of sensitive data, you define should be associated with one or more Search Matchers used during data discovery to find those values accurately. This article covers the use of Data Matchers, which examine the contents of data itself at search time.
Currently, Search Matchers can be divided into two sub-categories: Location Matchers and Data Matchers. Location Matchers apply strictly to structured and semi-structured data and use the structure of data sources to locate and classify data. Data matchers, on the other hand, directly inspect the content of data to determine if values match the specified search attributes of the data class.
Unlike Location Matchers, Data Matchers can be used for matching against structured, semi-structured, and unstructured data. Data Matchers are very useful when PII can be found in free-floating text. This includes but is not limited to, free text, Word documents, PDFs, images, and PowerPoint sources, either in standalone files or embedded in database collections.
The Different Types of Data Matchers
IRI Workbench currently supports six different types of Data Matchers:
- Data Pattern Matcher
- Set File Matcher
- Fuzzy Matcher (DarkShield Only)
- NER Matchers
- OpenNLP Matcher (DarkShield Only)
- PyTorch Matcher (DarkShield Only)
- TensorFlow Matcher (DarkShield Only)
Data Pattern Matcher
The Data Pattern Matcher is one of IRI’s most commonly used Data Matchers. Using a Java Regular Expression (RegEx) pattern, this matcher looks for a string matching particular formatting attributes. For example, a RegEx pattern for emails will look for the special character @ between words and check for a dot ( . ) followed by more letters (and possibly more dots) meant to represent an email domain.
From the IRI data class rules library (.dcrlib) form editor’s Data Matchers wizard page a Data Pattern Matcher can take three parameters. The first parameter is the RegEx pattern used to perform matching. A user can decide to either create their own pattern or use a pattern from the list of default patterns.
By default IRI ships several different patterns with its Workbench and we frequently update that list. To view and/or select from the list of preloaded patterns click the button Browse… next to the field called Pattern.
A dialog page will open, allowing the user to choose from a list of regex library files sorted by locality.
Once a selection is made click OK to continue to the Common Patterns wizard page that will display the current list of patterns in the Workbench pattern library from the previously selected library file.
On the Pattern Library wizard page, we can select, add, edit, remove, import, or export regex patterns. After finding a pattern of interest, select the pattern and click OK to return the initial Data Matcher page with the selected regex pattern loaded.
Another option aside from using a stored pattern, is to create and add a new pattern. This is done by selecting the Create… button next to the Pattern field. This will in turn display a new page called Pattern Editor. The Pattern Editor wizard page is for creating a new pattern or editing previously created patterns.
In the Regex Pattern field, we can provide a regex pattern that will be used to match on data. To verify that a pattern will match correctly to specific data adhering to specific formats we can use the Test Sample field to provide some test data. From the example above you can see that the regex pattern provided will match on emails.
Once satisfied with the regex pattern provided click OK to return to the initial Data Matchers page with a pattern added.
Moving on, the second parameter, Validator Script, is an optional parameter where the user can choose to upload a validator script used to validate the match. For example, a match may be found on a phone number, credit card, or SSN based on a pattern but without some way to validate if it is a real number, you may match on false positives.
Currently, only JavaScript-based validator scripts are supported. See this article for more details.
Lastly, the third parameter is another optional parameter called Groups. Groups allow matches on named groups of a RegEx pattern match. For example, a pattern that matches “exampleemail@gmail.com” may have a group that is used to find the domain (which in this case would return “gmail”).
You can add or remove groups in the Groups field of the Data Pattern wizard page. Using groups you could do something like preserve the domain name and mask only the preceding username.
Set File Matcher
A Set File Matcher, also commonly known as a dictionary lookup, uses a text file containing a list of strings to perform matches against. Each entry must be separated by a new line.
The file can have multiple columns (which must be tab-separated), but only the first column will be used in the context of data matching with the set file data matcher. This type of matcher is easy to use and is also very flexible in its purposes. See this article for more details about set files.
Set File containing addresses
From the IRI data class rules library (.dcrlib) form editor’s Data Matchers wizard page a Set File Matcher can take three parameters. The first parameter is the path to the set file used for the look-up. By clicking Browse the user can select a file on the local file system to provide a path to said set file.
As a side note, IRI Workbench ships with a decent amount of set files and its repertoire increases frequently. IRI also maintains sets of last names and gender-specific first names popular in more than 40 countries.
Some set files shipped with IRI Workbench
The second parameter provides the option to either match only on the whole word, or to allow matching on the parts of a word. To indicate whether to match on only the whole or not the user must check the field Match on Whole Word to indicate true or off for false accordingly. By default this field is checked on.
The third parameter provides the option to match case sensitive or insensitive. This can be useful if words may or may not follow normal capitalization conventions of speech. For example, John Smith may be present in text as either John Smith or JOHN SMITH.
By default, the field Case Insensitive is checked off. To allow case insensitive matching check the field on.
The fourth parameter provided in the field called Exclusion, allows the option to match only words that are not in the set file. By default, this option is set to false.
Fuzzy Matcher
A Fuzzy Matcher is similar to a set file lookup in that it performs matching against a list of provided words. Fuzzy Matchers differ from Set File Matchers in that they are not looking for exact matches but close approximation matches using various search algorithms; e.g., John Addams and John Adams would be a match.
From the IRI Library form editor’s Data Matchers wizard page, a Fuzzy Matcher can take up to five parameters:
- set file URL
- maximum distance
- fuzzy search method
- fuzzy search algorithm
- minimum similarity score
A set file URL can be either a local file or internet URL, such as a file in a GitHub repository.
There are two types of fuzzy matching methods supported by DarkShield: score (measures the similarity between the source value and set file value) and distance (a difference calculation). Some fuzzy matching algorithms, by the nature of the algorithm, only support one of the two methods and will use the single method that is supported even if the other method is specified.
DarkShield fuzzy matchers support many different types of fuzzy search algorithms. Different types of algorithms have distinct strengths and weaknesses, as briefly enumerated in the graphic below. For more information on each algorithm, see this project in GitHub.
Comparison of Fuzzy Matching Algorithms
Any match with a distance less than or equal to the distance specified (if the distance is the search method being used) will be considered a match. Any match with a score (which is only calculated if the score is the search method being used and the algorithm supports similarity scoring) greater than or equal to the score specified will be considered a match.
The following fuzzy matching algorithms support similarity scores:
- Normalized Levenshtein
- Jaro-Winkler
- Cosine Similarity
- Sorensen Dice Coefficient
- Ratcliff-Obershelp Pattern Recognition
NER Matchers
Named Entity Recognition via Natural Language Processing
Please refer to this article on training Named Entity Recognition models as a complement to this next section.
OpenNLP Matcher
OpenNLP Matchers support the Apache OpenNLP library. Apache’s OpenNLP library is a machine learning-based toolkit for the natural language processing (NLP) of text. The OpenNLP models DarkShield leverages are called Named Entity Recognition (NER) models.
NER models perform classification based on the context of words in sentences; i.e. using sentence grammar (natural language) to find entities like people’s names, locations, or organizations. OpenNLP NER models are considerably lightweight and fast in performance, but the tradeoff can be lower search accuracy.
From the IRI Library form editor’s Data Matchers wizard page, an OpenNLP Matcher can take three parameters. All three parameters are optional, as if none are provided default parameters will be passed at job execution.
The first parameter is the model URL parameter. This is the URL to the NER model used for classification. If nothing is passed as a model URL parameter, an English NER model will be used by default. To provide a URL, either type inside the text box of the Model URL field, or click Browse to select a model file (that will populate the form with the model file URL).
The second parameter is a sentence detector URL. The sentence detector is used in conjunction with NER tasks to split strings of text into individual sentences to be processed.
This is another optional parameter, and again if none is provided, an English sentence detector is used by default.
To provide a URL either type inside the text box of the Sentence Detector field or click Browse to select a sentence detector from the file system. This will in turn fill the form field with the file URL for the sentence detector binary file.
The third parameter is the tokenizer URL. The tokenizer is for splitting a sentence into smaller parts that provide meaning. This is another optional parameter, if none is provided, then the tokenizer provided with the model will be used instead.
To provide a URL either type inside the text box of the Tokenizer field or click Browse to select a tokenizer from the file system. This will in turn fill the form field with the file URL for tokenizer binary file.
PyTorch and TensorFlow Matchers
PyTorch and TensorFlow Matchers are machine-learning-based NLP models that perform NER classification on text. PyTorch and TensorFlow use different underlying frameworks for their models but both framework types are accessible on the Hugging Face cloud model repository.
Hugging Face is a community-driven library that supports repositories to open-source models.
Currently, PyTorch and TensorFlow are DarkShield-only Search Matcher types.
Compared to OpenNLP, these models are a heavier download and take a longer time to perform classification. In exchange, these models are far more accurate than OpenNLP models.
On the IRI Library form editor’s Data Matchers wizard page, both the PyTorch and TensorFlow Matcher can take up to four parameters. The parameters for the PyTorch and TensorFlow Matchers are exactly the same. As such, the same concepts will apply when creating either matcher.
The first parameter is the model URL. Like the OpenNLP matcher. If nothing is passed as a model URL parameter an English NER model will be used by default.
To provide a URL either type inside the text box of the Model URL field or click Browse to select from the file system the directory containing the model that will be used.
The second parameter is the tokenizer URL. The tokenizer is for splitting a sentence into smaller parts that provide meaning. This is another optional parameter, if none is provided, then the tokenizer provided with the model will be used instead.
To provide a URL either type inside the text box of the Tokenizer field or click Browse to select from the file system the directory containing the tokenizer that will be used.
The third parameter is the list of entity labels that may be used during classification. By default, all entity labels available to a model will be used if none are passed as a parameter in the wizard page. Entity labels dictate what groupings will be used during the classification process and may vary depending on the model.
For example, there may be a model that accepts four labels:
- PER (names)
- LOC (places or addresses)
- ORG (organizations)
- MISC (everything else)
Thus if you only wanted to find names and organizations, the list of entity labels would include PER and ORG. As previously mentioned, the entity labels that are allowed vary from model to model.
To identify the entity labels accepted by a model, check out the id2label list residing inside the config.json file inside that model’s directory.
To add a label to the list of labels that will be passed as a parameter click the Add button on the Entity Labels form field. Click Remove to remove a label from the list.
The fourth parameter is the aggregation strategy that will be used by the model. This is a way to fuse (or not) tokens based on the model prediction. A model can either use a strategy of none, simple, first, average, or max. For more information on aggregation strategy, see this documentation.
In Closing
Whether the data you need to find are in free-floating text, images, or documents, these Data Matchers deliver the freedom to match on data itself, along with the needed flexibility in the search-matching process. If you have any questions or need help with these concepts, please email info@iri.com.