Finding & Masking PII in Cloud File Stores
Editor’s Note: This article covers the discovery and de-identification of PII and other sensitive data in semi-structured or unstructured files stored in AWS S3 buckets, GCP folders, and Azure Blob Storage, via the IRI DarkShield API for files. See this article for files in SharePoint Online in Azure, and this article for configuring DarkShield file search and masking jobs in the GUI.
While DarkShield can also be used to find and mask data in NoSQL DBs, RDBs, and flat files in the cloud, IRI FieldShield is often more preferred for masking structured data due to its additional/simultaneous data mapping and manipulation capabilities. To read about masking flat files in cloud stores with FieldShield, see this article. For masking PII in RDBs, see the links in this FAQ.
The Popularity of Cloud Storage
As more computing activity makes its way into the cloud, so does storage. This is only logical given the need for proximity to data that cloud systems would need for performance in the same way on-premise machines should have data stored on or near them for faster processing.
Cloud data storage is also popular because of the purchasing and maintenance headaches associated with on-premise storage devices. Cloud providers allow companies of all sizes to store data off-site, whether or not that data is associated with other cloud services or applications.
Cloud Storage Encryption
There are typically two forms of encryption that cloud storage service providers offer: in-transit (dynamic) or at rest (static).
Encryption in transit is used as your data travels between your local machine and the cloud. The encrypted data is automatically decrypted when it reaches its destination. This may be accomplished via SSL (Secure Sockets Layer).
Encryption at rest, also known as a form of static or persistent data masking, is performed when the data is sitting idle in files, folders, databases or buckets. When data is extracted from the cloud, the at-rest encryption is automatically decrypted before the data leaves.
Security Concerns
Does the protection that cloud storage service providers provide guarantee your sensitive data will not be exposed? Even with in-transit and at rest encryption, there are still ways for the criminally minded to gain access to your sensitive data kept in the cloud.
According to Cypress Data Defense, “Some of the most common cloud security risks include unauthorized access through improper access controls and the misuse of employee credentials.”
Looking more deeply at the risks of misconfigured cloud storage, there are a couple of scenarios that make data vulnerable. First is when security groups are misconfigured. This can result in individuals accessing the cloud servers and extracting data.
When data is extracted from cloud storage, it will not be encrypted. As mentioned, data encrypted either at rest or in transit by service providers are undone by the time the data is in front of the individual.
The second risk of misconfigured cloud storage is a lack of access restrictions. A lack of access restrictions will again mean that individuals can gain access to data in your cloud storage.
Improper actions and behavior by company employees are a risk to consider in any industry. It is pretty easy to understand that an attacker originating from the inside does not need to worry about security when accessing data in the cloud.
Thus with the possibility of in transit and at rest encryption being circumvented, cloud storage users need an added layer of security to protect their data. Client-side, data-centric masking functions like encryption adds that necessary new layer of protection.
Unfortunately, client-side masking of semi-structured and unstructured data sources presents several development and support challenges for most companies — even ones with backgrounds in optical character recognition, machine learning, and so on.
About IRI DarkShield
IRI DarkShield is a data masking tool for finding and de-identifying sensitive data in semi-structured and unstructured files and databases. DarkShield is one of three core data masking products in the IRI Data Protector Suite which can leverage graphical data classification, searching, and masking job design models in the IRI Workbench IDE, built on Eclipse.
As of DarkShield Version 4, however, two powerful Remote Procedure Call (RPC) Application Programming Interface (API) versions are provided: the “Base” DarkShield API and the DarkShield-Files API. The DarkShield APIs extend the use of DarkShield functionality outside of Workbench and leverage a plugin on top of the IRI Web Services Platform (code named Plankton).
To find and protect sensitive data in a wide range of sources, the DarkShield APIs use specified search matchers and masking rules that follow business rules. For more information on creating search matchers and masking rules, please refer to this article.
The “Base” DarkShield API is used to search and mask unstructured text outside the context of files. Alternatively, the DarkShield-Files API provides the ability to search and mask files. With the assistance of the DarkShield-Files API, semi-structured and unstructured data like plain text files, csv/tsv, word documents, excel, pdf, json, xml, parquet, jpeg, and png images can be searched and masked.
Amazon S3, Google Cloud, MS Azure, and the DarkShield Files API
In the cloud market for storing files, usually referred to as BLOBs (Binary Large Object Blocks), there are several competitors. The focus of this article is on the three best known cloud storage service providers: Amazon S3, Google Cloud Storage, and Microsoft Azure Storage, and how to leverage DarkShield-Files API with these public silos.
Previously, IRI DarkShield would only search and mask PII in on-premise file systems. With the most recent version however, DarkShield users can now add another important, finely-targeted layer of data protection atop the default security measures deployed by cloud storage service providers, too. This article demonstrates how the DarkShield-Files API can access, search, and mask PII inside cloud BLOBs.
Amazon S3, Google Cloud Storage, and Azure BLOB Storage all provide libraries which allow application code to be used to access content inside cloud storage from outside the cloud storage service’s console and command line. There is also copious documentation for these libraries.
The DarkShield-File API demos currently uploaded to GitHub are written in the Python language; as such, those projects use client libraries for Python. However, other calling languages, like Java, can be used.
These calling programs, or “glue code” to the API is where the cloud storage (or other specialized input and/or) procedures can be defined. See the sample code below for the IRI DarkShield API:
API Demo Setup File
Within the IRI darkshield-files-api demos in GitHub, there will be a setup file included. The setup file will define a search context, mask context, file search context, and file mask context that are needed by the DarkShield-Files API. Without these contexts defined, the DarkShield-Files API will not search or mask.
A search context is created to define the PII that will be looked for in the files. There are a variety of matcher types for search matchers. In the image above we have four matchers and three types of search matchers.
First, we have a SsnMatcher that uses regular expression patterns to search for any text that may follow the format of SSN. Second, a CountryMatcher that uses a dictionary lookup from a set file called countries.set, to look through a list of country names.
Third, a NameMatcher that uses a Named Entity Recognition (NER) model to identify names . Lastly, an EmailMatcher that uses regular expression patterns to search for any text that may contain a “@” and website suffix.
For specific file formats, the DarkShield-Files API provides users with additional filtering and matching options. In this example, path matchers are provided for json and xml files.
For the API to know what to do with PII that has been discovered during search operations, a mask context must be defined. The first part of a mask context contains a list of rules that we want to apply.
In the code above we have three rules called HashRule, RedactSsnRule, and FpeRule. Respectively, the rules were assigned a hashing function, a function to replace characters with ‘*’, and format preserving encryption. The DarkShield API uses the same masking functions as IRI FieldShield (which masks structured data in SortCL-compatible job scripts).
These expressions are documented in the FieldShield manual and IRI Workbench, and because the functions are compatible, enterprise data integrity can be preserved post-masking regardless of source.
The rule matchers are relatively easy to understand. I will walk through the second rule matcher, FpeRuleMatcher. It specified that any text found using the EmailMatcher matcher (looks for emails based on regular expression) and NamedMatcher (looks for names using NER model) which are defined in search contexts will use the FpeRule rule.
The FPE (Format Preserving Encryption) encrypts plaintext into the same number of characters in ciphertext. The FpeRule was defined in the rules for the masking context.
Main File
In the DarkShield-Files API demos uploaded to GitHub, there is a main file present for each demo and contains the code that interacts with cloud storage and makes the DarkShield-Files API call. In the three demos that access cloud storage BLOBs, arguments will be passed from your command line to the program.
In the S3 demo, the user will provide one argument, the bucket name or the URL. The URL can contain a prefix, which means only files under a certain prefix will be searched and masked. The masked results will be placed in the bucket that was searched, under a folder named masked/.
One argument is passed (name of bucket to be searched).
One argument for the url is passed (allows targeting of specific folders and files to be searched).
Both in the Google Cloud Storage and Azure BLOB Storage demos, a user passes three arguments. First, a destination for where the masked BLOBs will reside. Second, the bucket/container to be searched.
Third, a path to navigate through the directory (last argument is optional). If no third argument is provided all of the files in the bucket/container will be searched and masked automatically.
Two arguments (name of bucket to place masked BLOBs and name of bucket to be searched)
Three arguments (same as before but third argument allows targeting of specific folders and files)
When you name a destination bucket/container for the masked BLOBs that does not yet exist, it will be created. In Google Cloud Storage though, if that destination bucket name was already used in another account, you will receive an error from the Google Cloud API.
Before and After Images of BLOB
Content of XML BLOB before masking
Content of XML BLOB after masking
Before and After Images of Cloud Buckets/Containers
Amazon S3
Contents of S3 bucket called example-bucket-iri before API run
A new folder called masked is created that will contain masked versions of original files
The newly created masked files are placed in the masked folder
Google Cloud Storage BLOBs
Google Cloud Storage bucket list before API run
The files inside the example-bucket-iri that will be searched for sensitive data
A new bucket called masked-bucket was created after the API was run
Newly masked files were placed inside the masked-bucket after API runs
Azure BLOB Storage
A list of containers in Azure BLOB Storage
The contents inside the new-container in Azure BLOB Storage
A container called masked-container is created after API runs
Newly masked files are placed in the masked-container after API runs
Conclusion
Thanks to the flexibility of its codeable solutions, the DarkShield-Files API is not limited in what data sources it can access. The API enables a critical and highly granular layer of protection to data in the cloud. By searching and masking sensitive data located in BLOBs in the cloud, data security can be maintained even in the case of compromised accounts or improper access to storage in the cloud.
Citations
Cypress Data Defense. (2020, July 13). 7 cloud computing security vulnerabilities and what to do about them. Medium. Retrieved September 24, 2021, from https://towardsdatascience.com/7-cloud-computing-security-vulnerabilities-and-what-to-do-about-them-e061bbe0faee.