
Choosing a Data Masking Tool
A data masking tool can be an effective, if not essential, way to protect sensitive information stored in files and databases. As the numbers of data breaches and data privacy laws continue to grow, companies must find ways to anonymize production and test data without sacrificing its utility for analytics, research, marketing, or development.
Since IRI introduced data masking in files off the mainframe in 2007 (in CoSort, then releasing FieldShield in 2011), many other companies have entered the market and compete in static and dynamic data masking contexts across structured, semi-structured, and unstructured sources. In this article, we break down the key considerations and recommendations for selecting a data masking tool that aligns with your organization’s data-centric security needs.
Understanding Requirements
Assessing the data you have, why it has to be masked, how it should be masked, and in what operational context are all important questions to ask. Before diving into the market of data masking solutions, it is crucial to understand your organization’s specific goals. Data masking involves replacing sensitive information with realistic but fictional data, and the requirements can differ widely based on usage. Ask yourself:
- What type of data needs masking? Consider whether you have structured data (relational databases and spreadsheets), semi-structured data (NoSQL databases and EDI files), or unstructured data (raw text, documents, images, etc.). Also, where is this data located (on-premise or in the cloud, and in what silos)? Who owns it, and where does the mask data need to go?
- How should the data be masked? Knowing the use of the masked data and the applicable business logic or privacy law requirements usually informs which masking functions should be applied to which types (or classes) of data. For example, do you need to de-identify PHI in an irreversible way (e.g., through redaction), or generalize it (e.g., through binning) to balance anonymity with utility? Or do you need a deterministic masking function to preserve consistency for referential integrity in a test database, and/or a realistic masking like pseudonymization or format-preserving encryption, which then gets into other issues like reversibility and key management.
- When is the source data updated? Some data sets are static while others refresh regularly or need to be masked on the fly. The tool you choose should be able to mask data in a framework that makes sense for your operational or DevOps pipeline(s).
- Are there any specific compliance requirements? The tool should be able to adhere to legal standards like GDPR, HIPAA, or other industry-specific guidelines. This not only means having the required masking functions, but also addressing Data Subject Access Requests (DSARs) like the right to erasure; your tool must find the applicable data.
Understanding these factors will help you identify data masking tool features that are most important to you. And, it will also provide a framework for comparing various products on the market, so your solution will not only mask data correctly but will also integrate seamlessly into your operations.
Data Masking Tool Evaluation Criteria
Once you define the requirements for data masking, the next step is to review tools on the market for their features and functions, ergonomics and efficiency, and short/long-term costs.
1. Data Discovery Capabilities
Before you can mask sensitive data, you need to find it. Look for tools that offer robust data discovery features, including pattern matching, metadata analysis, signature detection, and support for both structured and unstructured sources. The ability to scan databases, legacy files, PDF and Microsoft documents, images, and raw text or log files is essential for full coverage.
2. Masking Techniques and Flexibility
Different use cases require different masking techniques. A good tool should offer a range of methods, such as:
- Static, deterministic masking for non-production environments that preserves realism and referential integrity
- Dynamic masking for real-time applications, streaming data, and CDCFormat-preserving scrambling or encryption to keep data structures intact
- Pseudonymization, blurring, and other anonymization options, where needed
- The tool should also support deterministic masking when consistency is needed across systems or tables.
3. Scalability and Performance
Whether you’re masking a few tables, big files or an entire data lake, performance matters. Test how well the solution handles large datasets, parallel processing, and different source formats. Lightweight tools may work well for small datasets but struggle at scale.
4. Integration with Existing Systems
The ideal tool should integrate easily with your current data stack—whether that’s relational databases, Hadoop, cloud platforms, or legacy file systems. Bonus points for support across multiple data sources and the ability to automate masking workflows through scripting or APIs.
5. Compliance and Audit Readiness
Look for an on-premise or self-hosted tool that does not require data to leave your firewall or regional domain. Consider the tool’s search and job audit trails, re-ID risk scoring functionality, and role-based access controls. These features are vital for demonstrating compliance during audits and for enforcing policies around who can view data or change masking rules.
6. Affordability
Finding the best technical solution to your masking requirements will not do you any good if your organization cannot fund it. Thus it is important to consider as well what the licensing, support, customization, and upgrade costs will be and whether they align with your budget and return on investment requirements.
The Final Decision
The data masking tool you choose will come down to prioritizing and balancing the above evaluation criteria so you can have confidence that the tool will meet both current and future needs. Most sites develop a shortlist of tools from their own criteria rankings and request trial versions or demos.
In many cases, vendors offer pilot programs that allow you to mask a sample dataset. The POC phase will give you insight into the speed and accuracy of the tool, and the practical implications of deploying the tool with your mix of data, hardware, and users.
You can also assess the impacts each tool may have on future rollout items like installation, configuration, tuning, and auditing. Gather feedback from those who would use the tool, as well as those who would be affected by its results, costs, and total mix of features (be they potentially limiting or functionally extensible).
Consider as well the vendor’s experience in the data masking and data management industry, their product roadmap, and how dedicated they are to problem solving and ongoing improvement. Look for a company that not only stands behind the tool, but develops it, and customizes it to meet specific objectives.
Also remember that planned updates or enhancements may not seem relevant to your immediate requirements, but they should give you insight into how the tool provider adapts to evolving market needs beyond yours. This may pay dividends later when you need to protect more data sources, follow new privacy regulations, or address new data security threats.
As a leader in data masking since 2007, IRI is one such vendor committed to the goals in this article. If you would like to exchange information or initiate a POC, contact info@iri.com or learn more at https://www.iri.com/solutions/data-masking.