XLS2DDF: IRI’s Spreadsheet Metadata Conversion Utility
This article is the third of four articles discussing newly added support for Excel files in the core IRI program for structured data definition and manipulation, SortCL. We are picking up from the prior article, demonstrating typical SortCL-compatible job scripts that read, process, and/or write data in Microsoft spreadsheets.
SortCL-driven programs include IRI CoSort transformation, NextForm migration, FieldShield masking, RowGen generation, and Voracity ETL jobs. For them to parse, process and format data in Excel, flat files and RDB tables, their column input and output layouts (along with various field-specific operations) must be written in SortCL /FIELD syntax.
These /FIELD statements are inside, or saved centrally and referenced in SortCL job scripts. In the case of Excel, where there are many relevant ranges and sheets, the process of manually converting hundreds of cell ranges and layouts into SortCL /FIELD statements could be overwhelming.
For this reason, IRI has supplemented its support for XLS/X data handling with a command-line metadata conversion utility called xls2ddf. This program automatically serializes SortCL script-ready /FIELD layouts in Data Definition Format (DDF) metadata files from XLS or XLSX spreadsheets.
xls2ddf can be run directly on the command line. It is also among other supported “2DDF”’ utilities in the IRI Workbench job design IDE, built on Eclipse, for graphical execution.
Below is an example of the tool run from Workbench. One of 8 source sheets (in this case) is shown in part of the top pane, the xls2ddf program running against it (and the others) is in the console view, and the converted /FIELD statements in the DDF file are shown on the bottom:
xls2ddf operation shown in IRI Workbench to facilitate Excel metadata conversion and use in the graphical job script design and management environment for SortCL-compatible programs.
Note how each SortCL /FIELD statement is named ordinarily, and referenced with the original spreadsheet definition (SDEF). SortCL uses the declared tab to internally delimit column values.
The -p option is the default, and repeats for each xls2ddf conversion run in bulk. The options for operating the utility en masse, or with more cell range granularity, are discussed below.
Bulk Metadata Conversion
As in the case above, you can use the utility to create multiple DDF files from multiple sheets at once. Open the Import File Metadata wizard — accessible from the metadata menu under the blue IRI swirl icon in Workbench. Just select the folder and/or subfolders that contain the XLS or XLSX files (which you can filter), and check (or uncheck) those that you want to convert (or not).
When you click Finish, xls2ddf will be called for each file selected, producing several DDF files. Those will be written automatically into the metadata folder for the IRI project you specified or have highlighted (and thus active) in the Project Explorer.
Several additional options may be specified from the Advanced… button next to Source Format:
For example, a header option may be specified to get the field names from the first used row in the sheet. A default sheet name to use may be specified here as well. Note that if there is no match between the specific sheet name specified and any of the sheets in the file, the first sheet will be used by default.
There is also a manual option, which is best suited for picking a specific subset of the usable range from which to generate a DDF. For example, the usable range for a sheet may be A1:H1000, but the manual option may be used to generate a DDF that matches the bounds specified from other options.
To generate a ddf that matches the first sheet of range G25:H50 in a file called Book3.xlsx that has a used range broader than that, you could issue the command :
xls2ddf -m -r25 -l26 -b7 -c2 Book3.xlsx
Automatic Layout Detection
By default, xls2ddf will try to auto-detect the layout of the spreadsheet. This involves taking the used range, or range that contains any data, of the spreadsheet. A sheet name may be specified to take a DDF from a specific sheet.
By way of example, this command:
xls2ddf -p testdata.xlsx
automatically generates a DDF with a commented-out process line by looking in the XLSX file. Since no sheet name is specified, the first sheet in the workbook will be used. Given this sheet:
the output from the conversion would be this DDF file:
# /FILE=”'data'!A1:F1001;testdata.xlsx” # /PROCESS=XLSX /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="=A") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="=B") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="=C") /FIELD=(FIELD4, POSITION=4, SEPARATOR='\t',SDEF="=D") /FIELD=(FIELD5, POSITION=5, SEPARATOR='\t',SDEF="=E") /FIELD=(FIELD6, POSITION=6, SEPARATOR='\t',SDEF="=F")
Given the same spreadsheet file, but in XLS format, the metadata conversion command is:
xls2ddf -p testdata.xls
which automatically generates this DDF from the XLS file (with a commented-out process line):
# /FILE=”'data'!A1:F1001;testdata.xls” # /PROCESS=XLS /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="=A") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="=B") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="=C") /FIELD=(FIELD4, POSITION=4, SEPARATOR='\t',SDEF="=D") /FIELD=(FIELD5, POSITION=5, SEPARATOR='\t',SDEF="=E") /FIELD=(FIELD6, POSITION=6, SEPARATOR='\t',SDEF="=F")
Manual Range Conversions
This command cites a manually specified range and generates DDF fields based on that range.
xls2ddf -SfirstSheet -m -r1 -l4 -b3 -c2 testdata.xls
# /FILE=”’firstSheet’!C1:D4;testdata.xls” /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="=C") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="=D") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="=E")
To add a /PROCESS and sheet name comment to the DDF, use the -p and -S flags; e.g.,
xls2ddf -m -p -Sdata -r1 -l4 -b3 -c2 testdata.xls
# /FILE=”'data'!C1:E5;testdata.xls” # /PROCESS=XLS /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="=C") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="=D") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="=E")
When the -H (Header) argument is added to the xls2ddf command, SortCL field names will be taken from the header row. If one of the columns has a blank value in the header, a generic field name will be given instead.
Another option is to generate a full script. This option can be useful for testing that the generated DDF file matches the data that should be processed.
Full Script Example
Consider an XLSX file called exceltest1.xlsx that happens to have one sheet with 32 columns of data displaced significantly from the first row and column. This command:
xls2ddf -F -osampleOut1.xls exceltest1.xlsx test1.scl
will automatically use the full used range from the first sheet, since neither the manual option nor the sheet name are specified. The output file is specified as an XLS file, but note that xls2ddf also supports output to XLSX, delimited text files, XML, and JSON just by virtue of the file extension specified in the command line. The script will be saved in the file “test1.scl”.
The full script option generates a skeleton of a script, rather than a skeleton of the field layout. It is not a replacement for fully-featured script generation available in IRI Workbench; however, it may be useful for quickly verifying the field structure.
Field functions, filter statements, the script action section, etc. may be modified manually later to suit individual needs, or can be edited from the IRI Workbench using wizards if the script is imported into there. Here is the generated job script:
/INFILE="'Sheet1'!TXC360505:TYH364813;exceltest1.xlsx" /PROCESS=XLSX /FIELD=(FIELD1, POSITION=1, SEPARATOR="\t",SDEF="=TXC",TYPE=ASCII) /FIELD=(FIELD2, POSITION=2, SEPARATOR="\t",SDEF="=TXD",TYPE=ASCII) /FIELD=(FIELD3, POSITION=3, SEPARATOR="\t",SDEF="=TXE",TYPE=ASCII) /FIELD=(FIELD4, POSITION=4, SEPARATOR="\t",SDEF="=TXF",TYPE=ASCII) /FIELD=(FIELD5, POSITION=5, SEPARATOR="\t",SDEF="=TXG",TYPE=ASCII) /FIELD=(FIELD6, POSITION=6, SEPARATOR="\t",SDEF="=TXH",TYPE=ASCII) /FIELD=(FIELD7, POSITION=7, SEPARATOR="\t",SDEF="=TXI",TYPE=ASCII) /FIELD=(FIELD8, POSITION=8, SEPARATOR="\t",SDEF="=TXJ",TYPE=ASCII) /FIELD=(FIELD9, POSITION=9, SEPARATOR="\t",SDEF="=TXK",TYPE=ASCII) /FIELD=(FIELD10, POSITION=10, SEPARATOR="\t",SDEF="=TXL",TYPE=ASCII) /FIELD=(FIELD11, POSITION=11, SEPARATOR="\t",SDEF="=TXM",TYPE=ASCII) /FIELD=(FIELD12, POSITION=12, SEPARATOR="\t",SDEF="=TXN",TYPE=ASCII) /FIELD=(FIELD13, POSITION=13, SEPARATOR="\t",SDEF="=TXO",TYPE=ASCII) /FIELD=(FIELD14, POSITION=14, SEPARATOR="\t",SDEF="=TXP",TYPE=ASCII) /FIELD=(FIELD15, POSITION=15, SEPARATOR="\t",SDEF="=TXQ",TYPE=ASCII) /FIELD=(FIELD16, POSITION=16, SEPARATOR="\t",SDEF="=TXR",TYPE=ASCII) /FIELD=(FIELD17, POSITION=17, SEPARATOR="\t",SDEF="=TXS",TYPE=ASCII) /FIELD=(FIELD18, POSITION=18, SEPARATOR="\t",SDEF="=TXT",TYPE=ASCII) /FIELD=(FIELD19, POSITION=19, SEPARATOR="\t",SDEF="=TXU",TYPE=ASCII) /FIELD=(FIELD20, POSITION=20, SEPARATOR="\t",SDEF="=TXV",TYPE=ASCII) /FIELD=(FIELD21, POSITION=21, SEPARATOR="\t",SDEF="=TXW",TYPE=ASCII) /FIELD=(FIELD22, POSITION=22, SEPARATOR="\t",SDEF="=TXX",TYPE=ASCII) /FIELD=(FIELD23, POSITION=23, SEPARATOR="\t",SDEF="=TXY",TYPE=ASCII) /FIELD=(FIELD24, POSITION=24, SEPARATOR="\t",SDEF="=TXZ",TYPE=ASCII) /FIELD=(FIELD25, POSITION=25, SEPARATOR="\t",SDEF="=TYA",TYPE=ASCII) /FIELD=(FIELD26, POSITION=26, SEPARATOR="\t",SDEF="=TYB",TYPE=ASCII) /FIELD=(FIELD27, POSITION=27, SEPARATOR="\t",SDEF="=TYC",TYPE=ASCII) /FIELD=(FIELD28, POSITION=28, SEPARATOR="\t",SDEF="=TYD",TYPE=ASCII) /FIELD=(FIELD29, POSITION=29, SEPARATOR="\t",SDEF="=TYE",TYPE=ASCII) /FIELD=(FIELD30, POSITION=30, SEPARATOR="\t",SDEF="=TYF",TYPE=ASCII) /FIELD=(FIELD31, POSITION=31, SEPARATOR="\t",SDEF="=TYG",TYPE=ASCII) /FIELD=(FIELD32, POSITION=32, SEPARATOR="\t",SDEF="=TYH",TYPE=ASCII) /REPORT /OUTFILE="'Sheet1'!TXC360505:TYH364813;sampleOut1.xls" /PROCESS=XLS /FIELD=(FIELD1, POSITION=1, SEPARATOR="\t",SDEF="=TXC",TYPE=ASCII) … Same fields (2-31) shown above ... /FIELD=(FIELD32, POSITION=32, SEPARATOR="\t",SDEF="=TYH",TYPE=ASCII)
In conclusion, the dynamic generation of SortCL-compatible file structures, and even the job scripts that leverage that /FIELD metadata, is a big time saver. IRI created another command-line and Workbench-supported metadata conversion utility called xls2ddf to automate these steps to facilitate the processing of spreadsheet data in IRI Voracity platform software.
The fourth and final article will discuss support in IRI Workbench for generating SortCL and compatible scripts to transform, migrate, consolidate, mask, and cleanse — among other functions — data in Excel spreadsheets via an intuitive interface for metadata discovery.