Masking PHI in DICOM Files with DarkShield (API)
Abstract: With hundreds of thousands of medical imaging devices in use, DICOM is one of the most widely deployed healthcare messaging standards in the world; billions of DICOM images are currently in use for clinical care. This article describes the search and de-identification of Protected Health Information (PHI) in DICOM file metadata and imagery using the IRI DarkShield data masking tool, and its Remote Procedure Call (RPC) API for files in particular. Note that DarkShield can also perform the same data discovery and de-identification of PHI for DICOM metadata and burned-in pixels from the DarkShield GUI or CLI.
Introduction to DICOM
DICOM, or Digital Imaging and Communications in Medicine, is a standard for the communication and management of medical imaging information and related data. DICOM is implemented in almost every radiology, cardiology imaging, and radiotherapy device (X-ray, CT, MRI, PET, Ultrasound, etc.), and increasingly in devices in other medical domains such as ophthalmology and dentistry.
DICOM defines individual files (typically having a .dcm file extension) that have a unique binary structure consisting of a header and a data set consisting of a list of attributes. The attributes include information about the scan such as the patient name. The final attribute is the actual pixel data (imagery) of the scan.
DICOM also defines a directory structure that organizes scans based on patient, study, and series. This directory structure contains metadata in CSV format at the root of the directory.
DICOM files cannot be easily edited with a text editor nor viewed with a typical image viewer due to their unique binary structure.
Individual DICOM files contain a list of attributes. Attributes include information pertaining to the scan such as patient name, date of birth, hospital name, etc. along with the pixel data of the scan as the final attribute in the sequence. Some attributes are optional, and all can be identified with a tag.
A DICOM viewing program will typically display the pixel data as an image with the other attributes overlaid onto the image, even though these other attributes are actually separate from the pixel data itself. The other attributes are still a part of a DICOM file, but not a part of the pixel data. It is possible for DICOM files to have burned-in text embedded in the pixel data.
Masking Sensitive DICOM Data with the DarkShield API
The DarkShield API for files now offers a solution for searching and masking sensitive attributes in a DICOM file, which builds on some of the existing file handling capabilities already in the API, including HL7 and X12 EDI files:
Pixel data is just one of many attributes that may be contained in a DICOM file, and is separate from other attributes that may contain key- or quasi-identifiers such as patient name, date of birth, and hospital. DarkShield will search through all attributes that are not a part of the pixel data.
A DICOM directory may contain CSV metadata at the root of the directory, as shown in the image below. A calling program can traverse this directory and send the CSV metadata as well as the DICOM files to the DarkShield Files API.
Original CSV metadata file with information about each study.
Additionally, a series of black boxes may be specified in a file mask context to redact known portions of pixel data that may have sensitive information in burned-in text. The height, width, and X and Y coordinates of each black box can be specified in the configuration or automatically discovered.
Example of DarkShield Masking
A demo program available on GitHub demonstrates how the contents of a DICOM directory could be masked using the DarkShield API. In this program, a whole directory is traversed to have the CSV metadata, folder names, and individual DICOM files searched and masked. The resulting masked DICOM directory is written to a separate folder.
The image shown below is one of the original DICOM files within the unmasked DICOM directory:
Original DICOM file with burned-in text contained in the pixel data, displayed in a DICOM viewer.
These DICOM files, along with others in the directory structure, the metadata in the root of the DICOM directory, and the folder names will be searched and masked. All sensitive information defined by search matchers and masking rules is treated consistently. Each search matcher is associated with a type of data, and each data type is masked based on the masking rules associated with each matcher.
The following image shows the same file in a DICOM viewer after the PHI attributes have been found and masked. In this case, a black box was applied to an area of the pixel data to redact specific portions of the burned-in text, permanently obfuscating the patient’s identity:
The CSV metadata in the directory was searched through and masked as well. The results are shown below.
Masked CSV metadata file — sensitive information is de-identified with format-preserving encryption
Auditing the Search and Masking Results
The DarkShield API returns search annotations as a part of its response when one of the search endpoints is called. Additionally, results of masking are returned as a part of the response to masking endpoints. The masking results and search annotations are in a friendly JSON format that can be aggregated in BI tools to gain insight from the contents.
For example, what sensitive data matches were found, and what search matchers found these matches, can be aggregated and displayed in a visual way. Visualization is an intuitive to gain insights from these results, especially if they are large in size.
Visualization of JSON results in Microsoft Power BI, returned as a part of the response from the DarkShield API when sending text or a file to mask. This image shows the most common sensitive values found, and the associated search matchers that were used to find them.
Inspecting visualizations of annotations and masked results can help confirm that the results are as expected. Other information included in the search annotations and masked results include the file name, any failed masking results, and where within the file the sensitive data was found.
Conclusion
DICOM is a key standard for storing personally-identifying data and images in the healthcare industry, but de-identifying DICOM files can be a difficult task due to their complexity. Nevertheless, data privacy laws such as those promulgated under HIPAA must be followed to de-identify or anonymize the PHI held in the data sources and silos you control, including DICOM files.
Fines and tarnished reputations have resulted in cases of PHI data breaches. DarkShield offers effective, easily integrated solutions to address these challenges and regulations at the same time.
1 COMMENT
[…] XML, Excel, and PDF files, as well as NoSQL DBs like Mongo and image formats like JPEG and DICOM. This article covers DarkShield support for HL7 and X12 EDI formats through its RPC API for […]