Date Format Masking
Note: This article, first published in 2014, refers to date reformatting technology in the SortCL data manipulation program central to the IRI CoSort data transformation product and larger IRI Voracity data management platform. The method below remains supported, but has also since been updated in 2017 through Boost library support to improve date calculation and reformatting. See this article for more details.
Note also: If you are looking to mask date values for data anonymization and de-identification purposes, this is performed through /FIELD data masking functions like blurring (random noise), and you should see this article instead. These functions also work in SortCL, and Voracity-including IRI FieldShield jobs, and can be coded to preserve consistent date intervals in HIPAA scenarios like admit-discharge.
It is rare that you would find an application or database that does not rely on some sort of date information. Transactions are often time/date stamped and many data entry records are comprised of information, e.g., birth dates, death dates, start and finish dates.
Date format masks provide a structure which uses specific characters and symbols that define the format of a field. The structures can be applied to both input and output data. In the input, a field must follow the specific pattern defined by the mask. In the output, the mask defines the format for converting from an input mask.
For example, if a database expects a birth date in the format where the components are a two-digit month, two-digit day, and a four-digit year (MM/DD/YYYY), and October 4, 1975 is entered, the user will be unable to save the data until the correct format is entered, which would be 10/04/1975. The exact behavior depends on how the database administrator or data architect has defined the mask.
Input masks are used for a number of entry fields, but are most commonly used for numbers (phone numbers, social security numbers, model/parts numbers, etc.) and dates. A date is a more complex data type than a string or a number. It has multiple parts (year, month, day, hour, and so on), and there are many rules for determining valid values and calculations, e.g., leap days and years, daylight saving time changes, national and company holidays, and date ranges.
For date format transformation through the IRI CoSort utility — or engine in the IRI Voracity data management (ETL) platform — we developed composite field syntax for their common SortCL data definition and manipulation program that defines and establish any date format needed. The data format mask can be expressed through a /TEMPLATE definition containing a mix of data elements and literals and used in data mapping.
Indeed, these elements can go onto be used in individual /FIELD definitions in other sections of the job (script) for manipulation and mapping purposes. When a composite field is read, each data item is extracted and assigned to an internal field. On output, a composite value is constructed from internal fields, expressions, and literals. On record input and output, date and time values are valuated and impossible combinations cause an error message.
Consider the input numeric composite dates, numeric_dates:
5 03/29/13 4 12/02/71 2 10/12/10
Use the following SortCL job script, comp_dates.ncl, to reformat the input dates as composites:
/TEMPLATE=(Dates_WMDY = "%u %m/%e/%y") # numeric Week Month/Day/Year(2) /TEMPLATE=(Dates_DMY = "%e-%m-%Y") # numeric Day-Month-Year(4) /TEMPLATE=(Dates_Spanish= "&A %d de %B de %Y") /INFILE=numeric_dates /FIELD=(In, POSITION=1, SIZE=10, STRUCTURE=Dates_WMDY(DofW,Month,Day,Year)) /REPORT /OUTFILE=out_form /FIELD=(Out, STRUCTURE=Dates_DMY(Day,Month,Year)) /OUTFILE=out_SP /LOCALE= spanish /FIELD= (Out, STRUCTURE=Dates_Spanish(DofW,Day,Month,Year))
The output of this script would result in the following:
out_form 29-03-2013 2-12-1971 12-10-2010 out_SP viernes 29 de marzo de 2013 jueves 02 de diciembre de 1971 martes 12 de octubre de 2010
The template syntax defines the parameters of the date format mask and each piece of data that is being read on input. The user can then choose whatever date format is required for the output. The user can even produce an output in multiple languages if necessary.
IRI supports other input masks* particularly for numeric formatting. Refer to section 11.12 in the SortCL program chapter of the CoSort v9.5.3 user manual for more information. See the tech tip in this newsletter for an example of the implied decimal attribute used in this context.
*Generally, data format (or input) masks facilitate data validation and prevent users from entering invalid data (such as a phone number in a date field). Such masks can also help ensure that users enter data in a consistent way. That consistency simplifies data identification and database maintenance. Note that masking the format of the data, in this case, differs from data masking in the data privacy/security context.