CLF and ELF Web Log Processing
This article is second in a 3-part series on CLF and ELF web log data. We previously explained CLF and ELF web log formats, and now introduce IRI solutions for manipulating and using web log data. The final article discusses web log data masking.
IRI provides a number of tools for handling CLF and ELF web log data more efficiently, especially when those files contain hundreds of thousands, or millions, of transactions:
- IRI NextForm can filter, reformat, re-map, replicate, federate (virtualize), and report from these logs
- IRI FieldShield can mask, encrypt, and otherwise de-identify personally identifiable information (PII), like IP address
- IRI RowGen can create safe, realistic test data in either CLF or ELF targets, plus custom log formats you define
- The SortCL program in IRI CoSort can do all of the above, plus perform and combine fast sort, join, and aggregate transformations
All of these tools share a common 4GL metadata and Eclipse GUI (IRI Workbench). The job scripts that power these applications rely on data layouts expressed in a simple, self-documenting Data Definition File (DDF) format. The layouts can be referenced in multiple jobs, and/or pasted into individual scripts.
For CLF File Users
IRI software supports the following formats with ready-made metadata repositories:
Format | Description | Layout File |
Common (Access) Log | contains basic information from the log | CLF_Access.ddf |
Referral Log | contains corresponding referral information | CLF_Referral.ddf |
Agent Log | contains corresponding agent information | CLF_Agent.ddf |
Each DDF metadata repository template contains the /FIELD specifications that IRI software job scripts require.
For ELF File Users
ELF files have a header containing lines of comments, followed by a line naming the data fields. IRI programs will skip the header when processing source data in the log when /PROCESS=ELF is specified in the input section of the job script. To generate a header record in an ELF target that uses the file’s field names and positions, specify /PROCESS=ELF in the output section of the job script.
Note that you can automatically generate the data definitions from ELF log files for use in IRI software jobs. The “ELF2DDF”(Extended Log format-to-data definition file) utility is a command-line translation program for converting W3C web data descriptions to DDFs.
ELF2DDF works by scanning web log headers to produce a descriptive file name and field layout specifications. ELF2DDF is also a GUI-supported option. Select it from the drop-down menu in IRI Workbench metadata conversion wizard.
Web Log Data Integration and Masking (Combined) Example
The web log file below contains information about the visitor’s IP Address, User, Date, Time, Port, User Request, Method, Status, Bytes transferred, User Agent.
A “-” in a field indicates missing data.
The table below contains customer information from another source, including: IP Address, User ID, Phone Number, and Name:
The job script below, written for the IRI CoSort package’s SortCL program, brings the two input sources together. In the same job script and I/O pass, the web log and customer table are sorted, joined, masked, and reformatted to produce an output report:
/INFILE=LOG /PROCESS=RECORD /ALIAS=LOGFILE /SPECIFICATION=metadata/logfile.ddf /INFILE="QA.CUST;DSN=OracleTwisterQA" /PROCESS=ODBC /ALIAS=CUST /SPECIFICATION=metadata/cust.ddf /JOIN INNER NOT_SORTED LOG NOT_SORTED CUST WHERE LOGFILE.remotehost == CUST.remotehost /OUTFILE=weblognew.out /HEADREC="Client-IP ENC-IP USERNAME CUSTOMER NAME \n\n" /FIELD=(LOGFILE.REMOTEHOST, TYPE=IP_ADDRESS, POSITION=1, SIZE=13, FRAME='\"') /FIELD=(MASK_CUST.REMOTEHOST=replace_chars(CUST.REMOTEHOST), TYPE=IP_ADDRESS, POSITION=16, SIZE=13, FRAME='\"') /FIELD=(CUST.USERID, TYPE=ASCII, POSITION=32, SIZE=16, FRAME='\"') /FIELD=(CUST.CUSTOMER, TYPE=ASCII, POSITION=43, SIZE=15, FRAME='\"')
The sources were joined over the visitor’s IP Address (remote host), and that key field was also the one masked with the replace_char() function. The result below shows the integrated and protected result of the consolidated operation:
Results can also be sent to stdout (instead of a saved file or table); such ad hoc views are typical of data federation or virtualization projects.
See the next article on How to Mask Data in Web Logs for information on IRI solutions for protecting clickstream data at the field level.