Finding PII Using Location Matchers
As we’ve learned from this article on Data Classification in IRI Workbench (as of DarkShield V5), the types of PII, or classes of sensitive data, you define should be associated with one or more Search Matchers used during data discovery to locate those values accurately. This article covers Location Search Matchers, which find data based on metadata or structure, not content.
Currently, Search Matchers can be divided into two sub-categories: Location Matchers and Data Matchers. Location Matchers apply strictly to structured and semi-structured data and inspect the structure of data. Data matchers, on the other hand, directly inspect the contents of data to determine if they match the specified search attributes of the data class.
As a general rule, Location Matchers have better performance during matching operations and have better accuracy. The caveat is that Location Matchers are not available when working with unstructured data, since Location Matchers rely on a predefined source structure to find PII.
The Different Types of Location Matchers
IRI Workbench currently supports six different types of Location Matchers:
- Location Pattern Matcher
- List File Matcher
- Range Matcher (DarkShield Only)
- Excel Cell Matcher (DarkShield Only)
- JSON Path Matcher (DarkShield Only)
- XML Path Matcher (DarkShield Only)
Location Pattern Matcher
The Location Pattern Matcher uses Java RegEx patterns to perform matches against the names of columns or other named locations in data sources with some structure.
For example, a pattern matching on a column name containing the keyword “phone” will match on the columns business_phone, personal_phone, and emergency_contact_phone. This matcher is very useful as long as good naming conventions have been applied when creating tables during table creation.
From the IRI Library form editor’s Location Matchers wizard page a Location Pattern Matcher will accept only one parameter. A Location Pattern Matcher will need a pattern to match on a column name.
To supply a pattern, simply fill the form field for the field called Pattern. Below is an image with the example of a pattern that will match any column names ending with the word “phone”.
List File Matcher
The List File Matcher allows for matching on multiple columns by utilizing a set file containing a list of column names. So if there is an exact match to any of the provided names, the List File Matcher is flagged. 1
From the IRI Library form editor’s Location Matchers wizard page, a List File Matcher accepts only one parameter. The parameter is the path to the set file containing the list of column names.
Example set file containing list of column names.
To select a set file, click the Browse button and select a file from the file explorer. Set files containing column names are expected to have a single column for the column names with no header.
Range Matcher
The Range Matcher is a matcher that allows for a range of columns in a CSV or TSV file to be matched based on the positions of those columns in the table. This matcher can be useful if the positional order of the columns in a table holds some sort of significance. Currently, this search matcher is supported only in DarkShield.
On the IRI Library form editor’s Location Matchers wizard page, the Range Matcher accepts only one parameter. The Range parameter is for matching on columns based on the range specified.
From the form field called Range, three arguments can be provided. The first argument is the filter type. The second argument is the starting index of the range. The third argument is the ending index of the range.
To better understand how the filter type affects the behavior of the range, consider this example:
There are five columns called A, B, C, D, and E. For each column, the index is based on its position or order amongst the other columns; i.e., Column A with an index of 1, Column B with an index of 2, Column C with an index of 3, and so on.
The behavior of each filter type selection is as follows, where the starting index i is 2 and the ending index j is 4:
- [i, j] returns columns B, C, D
- [i, j) return columns B, C
- (i, j] returns columns C, D
- (i, j) returns C
- ( , j) returns A, B, C
- ( , j] returns A, B, C, D
- [i, ) returns B, C, D, E
- (i, ) returns C, D, E
- ( , ) returns A, B, C, D, E
Excel Cell Matcher
A matcher for cells in Excel (xls/xlsx) leverages prior knowledge on the structure of the spreadsheets to be searched. As with the previously discussed Location Matchers, this is the best method for finding PII in Excel. Currently, this is a DarkShield-only Search Matcher type.
On the IRI Library form editor’s Location Matchers wizard page, the Excel Cell Path Matcher will accept six parameters:
The first parameter is Match Row. This should only be checked if matching on headers in a spreadsheet where the header starts from the left and data follows to the right, rather than the typical layout of headers at the top and data below.
The second parameter is Include Matched Cell. If checked, the header cell will be included as part of a match. If not, the header cell will be excluded. The default behavior is to exclude the header cells from a match.
The third parameter is the Cell Address Pattern. This parameter uses a pattern to match on cell addresses within a .xls or .xlsx format spreadsheet. To provide a pattern, either click Browse to choose from an existing pattern, or click Create to define a new pattern.
The fourth parameter is the Cell Value Pattern. This parameter uses a pattern to match on the entire content of the cell. This is useful for matching on header cells. All values below the header cell (or to the right of the header cell if the Match Row parameter is enabled) will be treated as part of the match. To provide a pattern, either click the Browse button to choose from an existing pattern or click the Create button to define a new pattern.
The fifth parameter is the Sheet Name Pattern. This parameter uses a pattern to match on the sheet name. As with the above parameters, either click the Browse button to choose from an existing pattern or click the Create button to define a new pattern.
The sixth parameter called Sheet Range matches on columns based on a specified range of indices. The Range parameter has three arguments: the filter type, starting index of the range, and ending index of the range. See the section on Range Matchers above for more information.
JSON Path Matcher
The JSON Path Matcher is a Location Matcher for values based on JSON file structure. If the structure of the JSON file or JSON object is known ahead of time, it is faster and more accurate to match on the JSON metadata instead of its data contents. Currently, this is a DarkShield-only Search Matcher type.
From the IRI Library form editor’s Location Matchers wizard page, a Json Path Matcher will acceptonly one parameter: the JSON path. The JSON path is used to match on the structure of a JSON object or file. In the field JSON Path, specify the path in correct JSONPath syntax.
Below is an example of what matches would be returned from the sample JSON (top) when utilizing the $..name JSON path matcher. $..name will match on all values of a JSON object that have the key of name.
XML Path Matcher
The XML Path Matcher is a type of Location Matcher for matching on values based on the XML structure in which they are found. If the structure of the XML document is known ahead of time, matching on the structure of the XML document instead of its contents will provide far faster ( and likely more accurate) search results than just a data matcher will. Currently, this search matcher type is only supported in DarkShield.
The XML path parameter is used to match on the structure of an XML document. From the IRI Library form editor’s Location Matchers wizard page, the XML Path Matcher will accept only one parameter called the XML path. In the XML Path field, specify the path using valid XPath syntax.
Below is an example of the results of using an XML path using //name. This path value will match on all XML elements with a tag name of name.
In Closing
If you are going to be searching or masking structured or semi-structured data and you know where PII may be located based on source metadata, Location Matchers are recommended. Location Matchers are more accurate, require less time to process data than Data Matchers, and are easy to configure. Of course, it is also possible to have a layered approach to search matching by using both Location and Data Matchers for higher accuracy. If you have any questions or need help with these concepts, please email info@iri.com.