SortCL-Compatible Excel Data Processing Examples
This article is a continuation of the previous article, which introduced IRI support for XLS and XLSX file formats in the SortCL program used for moving and manipulating spreadsheet data in both the IRI Voracity data management (ETL, data quality, etc.) platform, and its included fit-for-purpose products: IRI CoSort, NextForm, FieldShield and RowGen.
These job samples help demonstrate the breadth of what these metadata-compatible applications can now do with data directly to/from Excel sheets in either XLS or XLSX formats.
Sample Job Scripts
Note: To view XLS and XLSX files from within IRI Workbench, simply double click on the file in Project Explorer.
Example #1: Synthesizing Test Data
This script demonstrates the use of an IRI RowGen job script for generating a small amount of random test data to an XLSX file, and an XLS file with a header.
/INFILE=simple.in /INCOLLECT=5 /PROCESS=RANDOM /FIELD=(code,POSITION=1,SEPARATOR="\t",SIZE=5,TYPE=ALPHA_DIGIT) /FIELD=(value,POSITION=2,SEPARATOR="\t",SIZE=8,TYPE=WHOLE_NUMBER) /INCLUDE WHERE value >10 /REPORT /OUTFILE="'Sheet1'!A1:B5;randomnums.xlsx" /PROCESS=XLSX /FIELD=(code,POSITION=1,SEPARATOR="\t",TYPE=ALPHA_DIGIT,SDEF="=A") #format value field with a prepended string /FIELD=(valuef=format_strings("H2N3%s",value),POSITION=2,SIZE=12,SEPARATOR="\t",TYPE=ASCII,SDEF="=B") /OUTFILE="'Sheet1'!BB1:BC6,HEADER;randomnums.xls" /PROCESS=XLS /FIELD=(code,POSITION=1, SEPARATOR=",",SDEF="Code=BB") /FIELD=(value,POSITION=2,SEPARATOR=",",SDEF="Value=BC")
XLSX Output:
XLS Output:
Example #2: Synthesizing Random Data with encrypted fields
This hybrid RowGen-FieldSheld job demonstrates random selection of names and email domains from set files and the subsequent masking of those names with format-preserving AES256 encryption with an encrypted passphrase. An email address is generated using the first letter of the first name plus the last name. The output is to a specific range of an XLS sheet.
/INFILES=random_file_placeholder /PROCESS=RANDOM /INCOLLECT=100 /FIELD=(FIRSTNAME, TYPE=ALPHA_DIGIT, POSITION=1, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/names/names_first.set" SELECT=ANY) /FIELD=(LASTNAME, TYPE=ALPHA_DIGIT, POSITION=2, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/names/names_last.set" SELECT=ANY) /FIELD=(EMAIL, TYPE=ALPHA_DIGIT, POSITION=3, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/free_email_domains.set" SELECT=ANY) /INREC /FIELD=(FIRSTNAME, TYPE=ALPHA_DIGIT, POSITION=1, SEPARATOR="\t") /FIELD=(LASTNAME, TYPE=ALPHA_DIGIT, POSITION=2, SEPARATOR="\t") /FIELD=(EMAIL, TYPE=ALPHA_DIGIT, POSITION=3, SEPARATOR="\t") /FIELD=(FIRSTINITIAL=sub_string(FIRSTNAME,1,1),TYPE=ALPHA_DIGIT, POSITION=4, SEPARATOR="\t") /REPORT /OUTFILE="'Sheet1'!B1:D100;PERINFO.xls" /PROCESS=XLS /FIELD=(FIRSTNAMEENC=enc_fp_aes256_alphanum(FIRSTNAME, "epass:mfvndoTjj8PnGCVCB9pU0Q=="), TYPE=ALPHA_DIGIT, POSITION=1, SEPARATOR="\t", SDEF="=B") /FIELD=(LASTNAMEENC=enc_fp_aes256_alphanum(LASTNAME, "epass:mfvndoTjj8PnGCVCB9pU0Q=="), TYPE=ALPHA_DIGIT, POSITION=2, SEPARATOR="\t", SDEF="=C") /FIELD=(EMAIL_FORMATTED= format_strings("%s%s@%s",FIRSTINITIAL,LASTNAME,EMAIL), TYPE=ALPHA_DIGIT, POSITION=3, SEPARATOR="\t", SDEF="=D")
Set files:
Output:
Example #3: Move flat, fixed data to a specific section of an Excel sheet
This NextForm data migration script reads a flat file with fixed position field data and maps up to 44 rows to an XLS sheet between the BA and BD column range. The header row of column names is derived from the first half of the SDEF attribute in the /OUTFILE field statement. It also demonstrates that the SDEF argument is not case sensitive.
# Three character columns are not supported by XLS- the maximum column value is IV # Header row value specified as the first part of the SDEF. After the equals sign is the column letter. /INFILE=chiefs /PROCESS=RECORD /FIELD=(name,POSITION=1,SIZE=27) /FIELD=(year,POSITION=28,SIZE=12) /FIELD=(party,POSITION=40,SIZE=5) /FIELD=(state,POSITION=45,SIZE=2) /REPORT /OUTFILE=”'Sheet1'!BA1:BD44,HEADER;chiefsAllXLSeasy.xls” /PROCESS=XLS /FIELD=(name,POSITION=1,SEPARATOR=”\t”,SDEF="NAME=BA") /FIELD=(year,POSITION=2, SEPARATOR=”\t”,SDef="YEAR=BB") /FIELD=(party,POSITION=3,SEPARATOR=”\t”,sdef="PARTY=BC") /FIELD=(state,POSITION=4,SEPARATOR=”\t”,sDEF="STATE=BD")
Input File:
Output Sheet:
Example #4: Sort data from specific range of XLS sheet, select only one field to actually be output
# CoSort SortCL Example Job # Extracts & Sorts Selected Sheet Columns, Outputs to Pipe /INFILE=”'Sheet1'!B1:E43;chiefsALLXLS.xls” /PROCESS=XLS # select only two input columns from a dataset that spans more than those columns. /FIELD=(year,POSITION=1,SEPARATOR=",",SDEF="=C") /FIELD=(party,POSITION=2,SEPARATOR=",",SDEF="=D") /SORT /KEY=(year) # sort by year field, but only output party field, with a maximum size of 5 characters. /OUTFILE=stdout /PROCESS=RECORD /FIELD=(party,POSITION=1,SIZE=5)
Result:
Example #5: Generates Sorted Test Data and Populates Excel in a “vertical” or inverted fashion.
/INFILES=random_file_placeholder /PROCESS=RANDOM /INCOLLECT=10 /FIELD=(FIELD1, TYPE=ALPHA_DIGIT, POSITION=1, SIZE=6, SEPARATOR=",") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, SIZE=7, SEPARATOR=",") /FIELD=(FIELD3, TYPE=UPPERCASE, POSITION=3, SIZE=5, SEPARATOR=",") /SORT /KEY=(FIELD1) /OUTFILE=”'Sheet1'!A1:K3,HEADER,V;rgen.xls” /PROCESS=XLS /FIELD=(FIELD1, TYPE=ALPHA_DIGIT, POSITION=1, SIZE=6, SEPARATOR=",",SDEF="FIELD1=1") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, SIZE=7, SEPARATOR=",",SDEF="FIELD2=2") /FIELD=(FIELD3, TYPE=UPPERCASE, POSITION=3, SIZE=5, SEPARATOR=",",SDEF="FIELD3=3")
Result:
Example #6: Takes PII from a text tab separated file, writes an encrypted/pseudonymized version to one .xlsx file, and an unmodified version to another .xlsx file. The epass, or encrypted password, obscures the clear text encryption passphrase in the script.
/INFILE=personal_info /FIELD=(credit_card,POSITION=1,SEPARATOR='\t') /FIELD=(driv_lic,POSITION=2,SEPARATOR='\t') /FIELD=(name,POSITION=3,SEPARATOR='\t') /REPORT /OUTFILE=”'Sheet1'!MMM1:MMO10;personal_info_encrypted.xlsx” /PROCESS=XLSX /FIELD=(credit_card1=enc_fp_aes256_alphanum(credit_card,"epass:Gg87jCTR15Jro4AGE44ENw=="),POSITION=1,SEPARATOR='\t',SDEF="=MMM") /FIELD=(driv_lic1=enc_fp_aes256_alphanum(driv_lic,"epass:Gg87jCTR15Jro4AGE44ENw=="),POSITION=2,SEPARATOR='\t',SDEF="=MMN") /FIELD=(name,POSITION=3,SEPARATOR='\t',SDEF="=MMO",SET=C:\IRI\cosort100\sets\names\names_first_last.set) /OUTFILE=”'Sheet1'!M1:O10;personal_info_unencrypted.xlsx” /PROCESS=XLSX /FIELD=(credit_card,POSITION=1,SEPARATOR='\t',SDEF="=M") /FIELD=(driv_lic,POSITION=2,SEPARATOR='\t',SDEF="=N") /FIELD=(name,POSITION=3,SEPARATOR='\t',SDEF="=O")
Input data (tab-separated text file):
9654-4338-8732-8128 W389-324-33-473-Q Jessica Steffani 2312-7218-4829-0111 H583-832-87-178-P Cody Blagg 8940-8391-9147-8291 E372-273-92-893-G Jacob Blagg 6438-8932-2284-6262 L556-731-91-842-J Just Rushlo 8291-7381-8291-7489 G803-389-53-934-J Maria Sheldon 7828-8391-7737-0822 K991-892-02-578-O Keenan Ross 7834-5445-7823-7843 F894-895-10-215-N Francesca Leonie 8383-9745-1230-4820 M352-811-49-765-N Nadia Elyse 3129-3648-3589-0848 S891-915-48-653-E Gordon Cade 0583-7290-7492-8375 Z538-482-61-543-M Hanna Fay
Output:
Encrypted/Pseudonymized:
Unencrypted:
See this article to compare this FieldShield-specific approach with other IRI tools for masking data in Excel.
Example #7: This IRI RowGen example demonstrates the synthesis of 5,000 rows of test data with a header, targeting both XLS and XLSX files based on random value lookups into static inline sets and set files.
/INFILES=random_file_placeholder /PROCESS=RANDOM /INCOLLECT=5000 /FIELD=(FIELD1, TYPE=ASCII, POSITION=1, SEPARATOR="\t",SET={Dolphin,Fish,Sloth}) /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, SEPARATOR="\t",SET=”moreAnimals.txt”) /FIELD=(FIELD3, TYPE=ASCII, POSITION=3, SEPARATOR="\t",SET={Red,Orange,Yellow,Blue}) /REPORT /OUTFILE=”'Sheet1'!A1:C5001,H;rgen.xlsx” /PROCESS=XLSX /FIELD=(FIELD1, TYPE=ASCII, POSITION=1, FRAME='"',SEPARATOR="\t",SDEF="Animal=A") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, FRAME='"',SEPARATOR="\t",SDEF="Another Animal=B") /FIELD=(FIELD3, TYPE=ASCII, POSITION=3, FRAME='"',SEPARATOR="\t",SDEF="Color=C") /OUTFILE=”'Sheet1'!A1:C5001,H;rgen.xls” /PROCESS=XLS /FIELD=(FIELD1, TYPE=ASCII, POSITION=1, FRAME='"',SEPARATOR="\t",SDEF="Animal=A") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, FRAME='"',SEPARATOR="\t",SDEF="Another Animal=B") /FIELD=(FIELD3, TYPE=ASCII, POSITION=3, FRAME='"',SEPARATOR="\t",SDEF="Color=C")
Example #8: Unicode Example with no file line arguments and no SDEFs
This IRI CoSort job script demonstrates sorting Unicode data in a flat file and outputting it into an XLSX sheet with no arguments in the file line and no SDEFs:
/INFILE=chiefs10utf8.txt /PROCESS=DELIMITED /FIELD=(president,POSITION=1,SEPARATOR="|",type=UTF8) /FIELD=(term,POSITION=2,SEPARATOR="|",type=UTF8) /FIELD=(party,POSITION=3,SEPARATOR="|",type=UTF8) /FIELD=(state,POSITION=4,SEPARATOR="|",type=UTF8) /SORT /KEY=state /OUTFILE=chiefs10XLSeasyunicode.xlsx /PROCESS=XLSX /FIELD=(president,POSITION=1,SEPARATOR="|",type=UTF8) /FIELD=(term,POSITION=2,SEPARATOR="|",type=UTF8) /FIELD=(party,POSITION=3,SEPARATOR="|",type=UTF8) /FIELD=(state,POSITION=4,SEPARATOR="|",type=UTF8)
Input:
Output:
Example #9: Sorted US Presidents in Arabic
This script sorts the Arabic translation of U.S. president names and outputs to a specific column in an XLSX spreadsheet, with an English header of “NAME”.
/INFILE=chiefs_arabic /PROCESS=RECORD /FIELD=(name,POSITION=1,SIZE=100,TYPE=UTF8) /SORT /KEY=name /OUTFILE=”'Sheet1'!BA1:BD45,HEADER;chiefsAllXLSeasy.xlsx” /PROCESS=XLSX /FIELD=(name,POSITION=1,SIZE=100,SEPARATOR="\t",SDEF="NAME=BA",TYPE=UTF8)
Input data:
Output sample:
Example #10: Shift data between different sheet names and ranges in XLSX and XLS, filter one field
/INFILE="'data'!WZX1092:XAD2092;dfdfdf.xlsx" /PROCESS=XLSX /FIELD=(SSN, POSITION=1, SEPARATOR='\t',SDEF="=WZX") /FIELD=(FIRST_NAME, POSITION=2, SEPARATOR='\t',SDEF="=WZY") /FIELD=(LAST_NAME, POSITION=3, SEPARATOR='\t',SDEF="=WZZ") /FIELD=(EMAIL, POSITION=4, SEPARATOR='\t',SDEF="=XAA") /FIELD=(GENDER, POSITION=5, SEPARATOR='\t',SDEF="=XAB") /FIELD=(IP_ADDRESS, POSITION=6, SEPARATOR='\t',SDEF="=XAC") /FIELD=(URL, POSITION=7, SEPARATOR='\t',SDEF="=XAD") /INCLUDE WHERE GENDER eq "Male" /REPORT /OUTFILE="'Verylongsheetname'!EX1092:FD2092;outr.xls" /PROCESS=XLS /FIELD=(SSN, POSITION=1, SEPARATOR='\t',SDEF="=EX") # /FIELD=(FIRST_NAME, POSITION=2, SEPARATOR='\t',SDEF="=EY") # don’t include commented field in output /FIELD=(LAST_NAME, POSITION=3, SEPARATOR='\t',SDEF="=EZ") /FIELD=(EMAIL, POSITION=4, SEPARATOR='\t',SDEF="=FA") /FIELD=(GENDER, POSITION=5, SEPARATOR='\t',SDEF="=FB") /FIELD=(IP_ADDRESS, POSITION=6, SEPARATOR='\t',SDEF="=FC") /FIELD=(URL, POSITION=7, SEPARATOR='\t',SDEF="=FD")
Resulting Output:
Example # 11: This script demonstrates writing additional data to the same XLSX file as the data was read from, with an added header in the appended report. NOTE: This functionality is only available with the XLSX format and not the XLS format.
# write a report to the same file as data was read from /INFILE=test.xlsx /PROCESS=XLSX /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="=A") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="=B") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="=C") /FIELD=(FIELD4, POSITION=4, SEPARATOR='\t',SDEF="=D") /FIELD=(FIELD5, POSITION=5, SEPARATOR='\t',SDEF="=E") /INCLUDE WHERE FIELD3 eq "December" /REPORT /OUTFILE=”G1:K351,HEADER;test.xlsx” /APPEND /PROCESS=XLSX /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="Day=G") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="SSN=H") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="Month (December Only)=I") /FIELD=(FIELD4, POSITION=4, SEPARATOR='\t',SDEF="Integer=J") /FIELD=(FIELD5, POSITION=5, SEPARATOR='\t',SDEF="Decimal=K")
Input:
Output:
These examples demonstrate just an inkling of what is now possible against XLS and XLSX sources and targets using SortCL. In the next article, we will examine the xls2ddf utility, an executable that automatically builds SortCL data definition file (DDF) metadata files (field layouts) from existing spreadsheet headers.