Finding and Masking PII in Couchbase, Redis and Solr…
Abstract: This article discusses the use of the IRI DarkShield-Files API for finding and masking PII and other sensitive data in Couchbase, Redis, and Solr NoSQL databases. The two previous articles on NoSQL database masking via DarkShield discussed Cassandra, ElasticSearch, MongoDB, and BigTable, CosmosDB, and DynamoDB.
About Couchbase
Couchbase is a multi-model NoSQL database. Not to be confused with CouchDB, the Couchbase server was originally named Membase but later changed after the merger of CouchOne and Membase. In contrast to CouchDB, Couchbase is different in several ways including the lack of ability to attach files in documents and the implementation of a SQL-like language, referred to as N1QL (pronounced Nickel).
In Couchbase, documents will have an ID (key) and an associated value where the actual application data resides. Couchbase uses Key Value operations to retrieve or mutate values where the key is present.
Couchbase UI
Couchbase Security Concerns
Couchbase (like many database service providers) supports features like authentication, authorization, and encryption in transit. Similarly, Couchbase servers are vulnerable to the same kind of attacks as the rest of the database service providers out there. These include injection attacks, cross site request forgery attacks, credential leaks, and code exploits.
Furthermore, in regards to any sensitive data stored on Couchbase servers, Couchbase highly advises “to secure such data, encrypt all important data and index storage-locations, using transparent data encryption, provided by 3rd party on-disk encryption software-vendors; which denies data-access to anyone who either does not possess an appropriate encryption-key, or is otherwise non-compliant with the configured security policy”.1
Encrypting stored data is often referred to as encryption at rest and in an event where the server is compromised, the data residing in storage will remain secure. The purpose of DarkShield is to find and protect that data in a granular, consistent way.
About Redis
Redis is an open source BSD license, in-memory NoSQL key-value data structure store. An in-memory database is when data is stored in a system’s main memory, usually RAM.
By placing data on RAM the access to and processing of data is faster than disk I/O. The disadvantage of in-memory storage is durability because data is stored in volatile memory by default.
As a result of its speed, Redis has gained a large amount of popularity. According to Amazon “Because of its fast performance, Redis is a popular choice for caching, session management, gaming, leaderboards, real-time analytics, geospatial, ride-hailing, chat/messaging, media streaming, and pub/sub apps”.2
RedisInsight UI
Redis Security Concerns
Redis has been designed to be used in a trusted environment by trusted clients and as such it is not recommended to 3“expose the Redis instance directly to the internet or, in general, to an environment where untrusted clients can directly access the Redis TCP port or UNIX socket”.
Also, authentication is not enabled by default but can be turned on in the redis.conf file. The redis.conf file is also where the password is stored in clear text.
Redis does not allow for string escaping so under normal circumstances injection attacks should not occur. That said, currently the CONFIG command allows clients to change the working directory and the name of the dump file.
According to Redis, “this allows clients to write RDB Redis files at random paths, that is a security issue that may easily lead to the ability to compromise the system and/or run untrusted code as the same user as Redis is running.”
About Solr
Solr is an open-source search engine based on the Apache Lucene library and is written in Java. Its features include full text search, hit highlighting, faceted, real-time indexing, dynamic clustering, database integration, and NoSQL features.
Although Solr is known for its search functionality, it can also be used as a NoSQL document based database. Furthermore, Solr can ingest data from various sources such as JSON, XML, PDFs, and CSV files and index into JSON collections.
Solr Admin user interface
Solr Security Concerns
Solr security supports encryption with TLS to provide encryption in transit. As a result, traffic to and from Solr and between Solr nodes has measures that will prevent data leaks. Solr’s framework also supports authentication, authorization, and audit logging of users to identify users and restrict access to resources.
Currently, Solr does not support at rest encryption. Because at rest encryption is not supported by Solr, any individual who gains illegal access will be able to view sensitive data.
In addition, over the past several years many vulnerabilities using exploited code have been published. Those gaps could have ultimately resulted in the hijacking of the entire server.
About IRI DarkShield
IRI DarkShield is a data discovery and masking package for finding and de-identifying sensitive data in semi-structured and unstructured files and databases, including the NoSQL platforms in this article. DarkShield is one of three core data masking products in the IRI Data Protector Suite which leverage graphical data classification, searching, and masking job design models in the IRI Workbench IDE, built on Eclipse.
Two powerful Remote Procedure Call (RPC) Application Programming Interface (API) versions are also provided: the “Base” DarkShield API and the DarkShield-Files API. The DarkShield APIs extend the use of DarkShield functionality outside of Workbench and leverage a plugin on top of an IRI Web Services platform named Plankton.
To find and protect sensitive data in a wide range of sources, the DarkShield APIs use specified search matchers and masking rules that follow business rules. For more information on creating search matchers and masking rules, please refer to this article.
The “Base” DarkShield API is used to search and mask unstructured text outside the context of files. Alternatively, the DarkShield-Files API provides the ability to search and mask PII in files.
With the assistance of the DarkShield-Files API, semi-structured and unstructured data like plain text files, CSV/TSV, HL7/X12, Word, Excel, PDF, JSON, XML, Parquet, BMP, GIF, JPG, PNG, TIF, and DICOM images can be searched and masked.
How the DarkShield-Files API Can Bolster Data Security
Most modern databases support in-transit encryption to protect data as it travels and at rest encryption for data inside the database. The issue with in transit and at rest encryption is that data encrypted by service providers is undone by the time the data is in front of the individual.
This means as long they have the proper access rights to see the data, neither encryption method will remain in place. This by itself is not an issue, but what happens if there are improper access controls or an employee’s credentials are misused? This is where client-side encryption comes into play.
Client-side encryption is a method by which data is encrypted locally before it is placed in a database. If client-side encryption is implemented then data will still be encrypted when viewed by individuals in the event of illegal access to the data.
The DarkShield-Files API can implement client-side encryption by extracting data stored in the database, search and mask sensitive data, and then place the newly sanitized data back into the database.
DarkShield Search and Mask Contexts
The DarkShield-Files API is structured around using search contexts and mask contexts to perform search and masking operations on text parsed from the different file formats. The Files API also handles additional matchers, filters, and configuration options for specific file formats by configuring file search contexts and file mask contexts.
A search context designates the PII that will be annotated in the files read through matchers. There are a variety of matcher types for search matchers. The DarkShield-File API supports using search matchers based on regular expressions, named entity recognition (NER) models, and ‘set’ files, which is a text file with a list of entries separated by newlines.
For the API to know what to do with PII that has been discovered during search operations, a mask context must be defined. The first part of a mask context contains a list of rules that we want to apply. Each rule has an expression that dictates what masking function will be used.
These expressions are also documented in the IRI FieldShield manual and IRI Workbench, and because the functions are compatible with DarkShield, enterprise data integrity can be preserved post-masking regardless of source. The list of available masking functions include:
- Expressions
- Blurring (random noise)
- Deletion
- Encoding
- Scrambling
- Encryption (AES, 3DES, FPE, GPG, etc.) with multiple key management options
- Hashing
- Pseudonymization
- Redaction
- String Manipulation
Search Context
Mask Context
File-Specific Search and Mask Contexts
In addition to the contexts mentioned above there are also file search contexts and file mask contexts. File search contexts and file mask contexts are required when using the DarkShield-Files API. When accessing a NoSQL database, files may be stored in the documents that are accessed. As such the DarkShield-Files API should be used instead of the “base” DarkShield API that only handles text. The file search context has a name attribute that is used to identify the context that will be used in search operations and a matchers array that contains a list of the matchers.
Similarly, the file mask context has a name attribute that is used to uniquely identify the context when performing masking operations and a rules array that contains a list of rules which will be used to mask PII that were found during search operations.
Both the file search context and the file mask context allow for configuration options to be included. There are configuration options available based on the specific file format. The screenshot below displays the file mask context config object for JSON after setting the pretty print feature to true.
File Search Context
File Mask Context
Calls to the DarkShield-Files API
In the IRI darkshield-api-demos repository located on GitHub, the majority of the projects use Python code referred to as “glue code” to access the data source and feed the data through to the DarkShield-File API.
In the Couchbase demo, the Python program uses a Couchbase Python client that is used to programmatically connect to a Couchbase server and manipulate the data. Similarly, Redis uses a Python interface to interact with the Redis key-value store. Each Python library provides different methods to easily connect and access their respective databases.
In contrast, the Solr demo uses http requests to communicate with the Solr database and allows the user to supply the program with parameters like the host, port, target collection, query to use, and row limit. That said, Solr has Python client libraries that allow for similar implementation like Couchbase and Redis.
At least nine API calls need to be made to the DarkShield-Files API in the following order, to the:
- searchContext.create endpoint to create a search context
- maskContext.create endpoint to create a mask context
- fileSearchContext.create endpoint to create a file search context
- fileMaskContext.create endpoint to create a file mask context.
- fileSearchContext.mask endpoint to use the active search and mask contexts to find and mask any PII in a file (can be executed multiple times)
- fileMaskContext.destroy endpoint to destroy the current active file mask context
- fileSearchContext.destroy endpoint to destroy the current active file search context
- maskContext.destroy endpoint to destroy the current active mask context.
- searchContext.destroy endpoint to destroy the current active search context
In the darkshield-api-demos located on GitHub, the API calls to these DarkShield-Files endpoints are made through Python glue code, but the same can be demonstrated using other methods like Postman, curl commands, etc. Here is a link to a simple demo on how to make curl commands to the DarkShield-Files API to perform file search and mask operations on a text file.
Before and After Masking Operations on Couchbase, Redis, and Solr
Below is a list of links to DarkShield demos on GitHub along with before and after screenshots of performing masking operations on Couchbase, Redis, and Solr.
Couchbase
Original documents in a Couchbase collection
The same documents after DarkShield masking
Redis
Customer fields with unprotected PII values in Redis
Customers fields masked and output written to JSON files
Solr
Original collection in Solr
DarkShield–masked collection in Solr
In Closing
Through the use of programmable solutions the DarkShield-Files API has the flexibility to access and mask sensitive data in a variety of data sources. In this article we discussed Couchbase, Redis, and Solr databases and demonstrated the ability of the DarkShield-Files API to handle search and masking operations in these data sources.
Through the use of the DarkShield-Files API we are able to implement client-side encryption on these NoSQL databases. If you have any questions about finding or masking PII in NoSQL databases, please contact darkshield@iri.com.
- “Manage Connections and Disks.” Manage Connections and Disks | Couchbase Docs, https://docs.couchbase.com/server/current/manage/manage-security/manage-connections-and-disks.html#securing-on-disk-data.
- “Redis.” AWS, https://aws.amazon.com/redis/
- “Redis Security.” Redis, https://redis.io/topics/security.