Find & Mask PII in BigTable, Cosmos and Dynamo…
Abstract: This article covers the use of the IRI DarkShield API for automatically locating and de-identifying PII or other sensitive data in the three major cloud provider NoSQL databases — Google BigTable, MS CosmosDB in Azure, and Amazon DynamoDB. Prior articles in this blog cover how DarkShield wizards in IRI Workbench find and mask data in other popular NoSQL DBs, including Cassandra, Elasticsearch and MongoDB.1 A subsequent article covers CouchDB, Redis and Solr.
What is NoSQL?
NoSQL typically stands for “not only SQL” although others may say it stands for “non SQL”. NoSQL was introduced to provide an alternative to relational databases that at the time, were the dominant force in the industry.
Because NoSQL databases are non-tabular, data is stored differently compared to SQL databases. There are actually various types of NoSQL databases based on their data model. These data models include documents, key-value pairs, wide-column, and graphs.
The Strength of NoSQL Databases
According to CloudGuru.com, relational databases have “inflexible schemas and notoriously difficult horizontal scaling [which means] they don’t always fit well in a highly scalable and geographically distributed infrastructure stack”2.
In comparison, the flexibility of the NoSQL document-model makes it easier to change data. NoSQL databases are also easier to scale horizontally, and usually the cloud providers handle the operational overhead of managing infrastructure.
To know when to choose NoSQL over relational databases there are generally a few factors to consider for decision makers. According to MongoDB the drivers are: “fast-paced Agile development, storage of structured and semi-structured data, huge volumes of data, requirements for scale-out architecture, modern application paradigms like microservices and real-time streaming”3.
NoSQL DB Security Concerns
As with traditional relational (SQL) databases, NoSQL DBs have similar security issues, but also some unique risks. According to the International Journal of Digital Society, NoSQL vulnerabilities include: “insufficient or ineffective input validation, errors in the application level permissions handling, weak authentication, insecure communication, illegal access to unencrypted data, etc. are some of the vulnerabilities applicable for NoSQL”4.
Like SQL injections, NoSQL injections are also possible when input validation is not handled properly. Because NoSQL databases do not have a common query language, queries are written in the programming language (PHP, JavaScript, Python, etc) of the application connected to the database. This means NoSQL injections can result in commands being executed not only in the database, but also in the application itself.
There is a long list of endpoint security practices for NoSQL DBs. But even with them, would-be assailants still manage to punch holes in those defenses. Companies must thus evolve to harden the security profile of these collections with another level of protection.
This is where IRI DarkShield comes in. As a data-centric, or “startpoint security” solution, DarkShield masking provides another important layer of data protection atop the end-point measures deployed by cloud database service providers.
About IRI DarkShield
IRI DarkShield is a data masking tool for finding and de-identifying sensitive data in semi-structured and unstructured files and databases. DarkShield is one of three core data masking products in the IRI Data Protector Suite which leverage graphical data classification, searching, and masking job design models in the IRI Workbench IDE, built on Eclipse.
As of DarkShield Version 4, however, two powerful Remote Procedure Call (RPC) Application Programming Interface (API) versions are also provided: the “Base” DarkShield API and the DarkShield-Files API. The DarkShield APIs extend the use of DarkShield functionality outside of Workbench and leverage a plugin on top of an IRI Web Services platform named Plankton.
To find and protect sensitive data in a wide range of sources, the DarkShield APIs use specified search matchers and masking rules that follow business rules. For more information on creating search matchers and masking rules, please refer to this article.
The “Base” DarkShield API is used to search and mask unstructured text outside the context of files. Alternatively, the DarkShield-Files API provides the ability to search and mask PII in files.
With the assistance of the DarkShield-Files API, semi-structured and unstructured data like plain text files, csv/tsv, word documents, excel, pdf, json, xml, parquet, jpeg, and png images can be searched and masked.
AWS DynamoDB, Azure CosmosDB, Google BigTable and the DarkShield API
The companies reigning over cloud services for NoSQL databases are Amazon AWS with DynamoDB, Microsoft’s Azure CosmosDB, and Google’s Cloud BigTable. The focus of this article is on these three well known service providers and how the DarkShield-Files API can be leveraged to search and mask inside their NoSQL databases located in the cloud.
For those unfamiliar with connecting and querying NoSQL databases programmatically, not to worry. AWS, Azure, and Google Cloud are not only known for providing high quality service but also provide copious amounts of documentation on how to access their database content using Software Development toolkits (SDK) supported in various programming languages.
The DarkShield-File API demos currently uploaded to GitHub are written in the Python language; as such, those projects use client libraries for Python. However, other calling languages, like Java, can be used.
These calling programs, or “glue code” to the API, is where these procedures can be defined. See below for the links to the DarkShield-Files API demos:
DarkShield Search and Mask Contexts
Within the IRI darkshield-files-api demos in GitHub, there will be a setup file included. The setup file will define a search context, mask context, file search context, and file mask context that are needed by the DarkShield-Files API. Without these contexts defined, the DarkShield-Files API will not search or mask.
DarkShield API Search Context
A search context designates the PII that will be annotated in the files read through matchers. There are a variety of matcher types for search matchers. The DarkShield-File API supports using search matchers based on regular expressions, named entity recognition (NER) models, and matching based on predefined text that would be matched against in SET files.
The image above displays an EmailMatcher that uses regular expression patterns to search for any text that may contain a “@” and website suffix, a SsnMatcher that uses regular expression patterns to search for any text that may follow the format of SSN, and a NameMatcher that uses a Named Entity Recognition (NER) model to identify names.
File Search Context
For specific file formats, the DarkShield-Files API provides users with additional filtering and matching options. In this example, path matchers are provided for json and xml files.
Mask Context
Note: In older versions of the DarkShield-Files API, the configuration for rules and rulesMatcher requires the “type: cosort” and “type:name” in their respective configurations.
For the API to know what to do with PII that has been discovered during search operations, a mask context must be defined. The first part of a mask context contains a list of rules that we want to apply. Each rule has an expression that dictates what masking function will be used.
These expressions are also documented in the IRI FieldShield manual and IRI Workbench, and because the functions are compatible, enterprise data integrity can be preserved post-masking regardless of source. The list of possible masking rules include:
- Assignment Expressions
- Blur Functions
- Deletion Functions
- Encoding Functions
- Encryption Functions (AES, 3DES, FPE, GPG)
- Hashing Functions
- Pseudonym Replacement
- Redaction Functions
- String Manipulation Functions
In the code above we have three rules called HashRule, RedactSsnRule, and FpeRule. Respectively, the rules were assigned a hashing function, a function to replace characters with ‘*’, and format preserving encryption. The DarkShield API uses the same masking functions as IRI FieldShield (which masks structured data in SortCL-compatible job scripts).
Following masking rules are rule matchers. The rule matchers are easy to understand. Rule matchers pair search matchers with masking rules.
Lastly, is the file mask context. For specific file formats, the DarkShield-Files API provides users with additional configuration options. In this example, the configuration for json files has specified the implementation of pretty print.
File Mask Context
Authentication Credentials of NoSQL Demos
Accessing BigTable, CosmosDB, or DynamoDB programmatically requires the user’s login credentials in some form for authentication. There are various ways to store and access these credentials securely, but for the sake of simplicity the three NoSQL demos either use credential files or environment variables.
CosmosDB credentials.json | DynamoDB .aws/credentials file
Google BigTable allows you to generate a private key for your credential and download the newly generated key in a file.
Google BigTable demo uses an environment variable GOOGLE_APPLICATION_CREDENTIALS to designate a path to the private key contained in the file downloaded from Google Cloud Platform console.
Taking a Closer Look at the DarkShield API Interface to BigTable
The Main Program
To get an idea of how the main program would be implemented below is a screenshot of the Google BigTable main.py.
All of the previously linked demos use a main program that facilitates the DarkShield-Files API call. The main program will contain the glue code that performs the following actions:
- Authentication to the datasource (NoSQL DB)
- Accesses and queries the database
- Makes POST requests to the DarkShield-Files API with the content of the DB
- Resulting output from the DarkShield-Files API is written back to the database.
In the BigTable demo the resulting output has been written back into the database. Alternatively, the code could be altered to write the masked results to files or to a separate test database. The DarkShield-Files API is a flexible tool that is only limited by the glue code that manipulates it.
Executing the Program
To execute, run python main.py “project_id” “instance_id” from your terminal. For those wondering, project_id is your Cloud Platform project ID and instance_id is the ID of the Cloud Bigtable instance you wish to connect to.
Below is an example of what the execution may look like:
Results of Searching and Masking of PII via the DarkShield API
Google BigTable
Below is a demonstration of the results of search and masking operations performed on Google Cloud BigTable using the BigTable demo on GitHub:
BigTable Demo Project
Original data and masked results after execution of the IRI DarkShield BigTable demo
Azure CosmosDB
Below is a demonstration of the results of search and masking operations performed on CosmosDB:
CosmosDB data source explorer
Vulnerable PII in a CosmosDB collection.
CosmosDB collection item after masking
Amazon DynamoDB
Below is a demonstration of the results of search and masking operations performed on DynamoDB:
AWS NoSQL Workbench provides UI to DynamoDB
Unmasked PII in DynamoDB Collections
Masked results exported to csv format part 1:
Masked results exported to csv format part 2:
Conclusion
Finding and masking PII through the DarkShield-Files API is an “open” solution not constrained by the data source or silo. As with RDBs, files, documents and images, DarkShield’s API delivers flexible codable solutions to detect and protect sensitive structured, semi-structured and unstructured data in almost any NoSQL database, whether it runs on-premise or in the cloud.
- Note that the same DarkShield base API described herein can also be used on those three as well, and IRI is now also working to support Couchbase, Redis, and Solr. The DarkShield API for files finds and masks data in RDB C/BLOB columns, unstructured text and log files, semi-structured EDI files like HL7, JSON, X12 and XML, MS and PDF documents and many image formats.
- Vanbuskirk, Mike Nov, et al. “NoSQL Databases Comparison: Cosmos DB VS DynamoDB VS Cloud Datastore and Bigtable.” A Cloud Guru, 25 June 2021, acloudguru.com/blog/engineering/comparing-cloud-nosql-databases-dynamodb-vs-cosmos-db-vs-cloud-datastore-and-bigtable
- What Is Nosql? NoSQL Databases Explained.” MongoDB, www.mongodb.com/nosql-explained.
- Shahriar, Hossain, and Hisham M Haddad. “Security Vulnerabilities of NoSQL and SQL Databases for MOOC Applications.” International Journal of Digital Society, Mar. 2017.