Flat-File Profiling
Update: Q2’16: In addition to the flat-file profiling wizard in the data discovery menu group in IRI Workbench described below, IRI has introduced robust data classification that enables the application of field rules for multi-source data transformation and protection through data class libraries. Update Q2’19: IRI now provides a multi-file data class masking wizard for users IRI FieldShield (or Voracity), and just published this article to show how the DB profiling results below display in Splunk.
Data profiling is the essential discovery process that helps you analyze, classify, cleanse, integrate, mask, and report on data in your repositories. With the information profile processes produce, you can make decisions about data integration and analysis, migration and masking, etc.
As with relational and modern data sources, awareness of the content and character of structured files helps data architects assess and improve the usability, quality, and security of these less sexy, but still mission-critical collections.
IRI previously released dark (document) data and relational database profiling modules in the IRI Workbench GUI for all IRI software, built on Eclipse™. Similar profiling of data in flat files is now also in this GUI — which is free to all IRI software product users — and included in the free Lite edition of the IRI NextForm data migration product. In other words, all IRI data profiling tools are free.
Using a graphical wizard similar to the database profiler, it is now possible to scan files and forensically assess their characteristics. Key statistics — such as data counts, aggregates, field lengths, minimum and maximum values — plus fuzzy string matches and patterns, are now readily available and covered in the how-to introduction that follows.
Inputs
The first part of the wizard is the input selection page, which is comprised of a few different elements:
- The display where the selected file to profile will be shown.
- The Add Data Source and Edit Source Option buttons. These buttons allow you to select the file to profile.
- The Add Existing, Discover, and Convert Metadata buttons. These buttons allow you to select the metadata for the input file.
To choose the input file, click Add Data Source, and select the file to be profiled. Then click the appropriate button to specify metadata for the input file. You can select an existing data definition file (DDF) or auto-create one via metadata discovery or conversion. Once your input file and its metadata are displayed, click Next to move on.
Options
The second part of the wizard is the field selection page, which displays a table of profiling statistical output to produce, followed by the fields in the file on which you choose to generate the values.
The options matrix will automatically change to represent the fields of the selected file. There are several ways to handle the viewing options:
- For all options, click the top-left box All, and all the metadata will be reported.
- For basic options only (counting and values), select Basics.
- For length options only (value lengths), select Lengths.
- If you have many fields and want the same option for all of them, click the option name itself, and all the fields will have that option selected.
- Once everything is set, click Next or Finish.
Expression Search
A unique choice in the options table is the Expression Search. This option allows you to search fields against a variety of search options. These options are:
- Regular Expressions (Pattern Search). This locates and counts the number of times a value matches the format of a search pattern.
- Fuzzy String. This option allows you to search for strings similar to those you enter, and to select or specify search conditions.
- Values File. This option allows you to compare a string to every string in a set file and count each string that has a match.
The Expression Search page has 6 important sections.
- A Search Type combo box to select the type of search to perform.
- The options group that changes depending on the search type selected.
- Regular Expression: has two buttons; browse which browses existing regular expressions, and Create, which allows the creation of new regular expressions.
- Fuzzy String: has a counting box that specifies the threshold of the fuzzy search (how close the strings have to be to be considered a match), and a combo box to select the fuzzy search algorithm to use.
- Values File: has a Browse button that lets you search for the set file to use for the value search.
- A text box where you will enter the data for your search.
- A drop-down list of the input files to which you can apply the expression search.
- A drop-down list of the fields to which you can apply the expression search.
- A table that lists the searches you have created that will be performed by the profiler.
To create a regular expression filter:
- Set the Search Type combo to Regular Expression.
- Click Browse to (your library of saved expressions), or Create to specify a regular expression to use in searching for field values.
- In the File menu, select the table that contains the field to filter.
- In the Field menu, select the field to which the regular expression should be applied.
- Click Add to Table, and an item will appear in the table below that contains the file name, field name, search source, threshold, and regular expression label that make up the filter.
- Repeat this process for each field to which you want to add a filter.
To create a fuzzy search:
- Set the Search Type combo to Fuzzy Search.
- Type the string to use for fuzzy search.
- Select the number of results to return (this option will appear when Fuzzy Search is selected).
- Select the fuzzy search algorithm to use (this option will appear when Fuzzy Search is selected).
- In the File drop-down menu, select the file that contains the field to fuzzy search.
- In the Field drop-down menu, select the field to which the fuzzy search should be performed.
- Click Add to Table, and an item will appear in the table below that contains the file name, field name, search source, threshold, and the search type of the fuzzy search to be performed.
- Repeat this process for each field where you want to perform a fuzzy search.
To create a values search:
- Set the Search Type combo to Values File.
- Click Browse to select a set file that the field will be checked against.
- In the File menu, select the table that contains the field to filter.
- In the Field menu, select the field to which the regular expression should be applied.
- Click Add to Table, and an item will appear in the table below that contains the file name, field name, search source, threshold, and values search label that make up the filter.
Profiling Results
Click Finish to complete the job. The results of the profiling criteria you specified display in a CSV file.