Overview
Aggregate
Cross-Calculate
Custom Transforms
Formats/Keys
Lookup
Match/Join
Pivot/Unpivot
Data Remapping
Scrub/Cleanse
Select/Filter
Sort/Merge
Substrings
Type-Convert
Challenges
There are many lower-level, data-value-related problems that ETL architects face, such as:
Issue | Examples or Details |
---|---|
Applying masks to data formats to represent or re-interpret values |
ID'inggender, 'F' or '2' or masking dates: mm/dd/yy to |
Homogenizing inconsistent production data keys |
assigning unique numeric surrogates for recognition and faster indexing |
Pivoting and unpivoting |
transposing columns to rows and vice versa without duplicates |
Inserting constants or literal strings |
replacing or adding substitute values to protect or pre/re-label data |
Setting or labeling values |
evaluating a condition to determine a default label like 'NULL' |
Validating values |
determining if the inbound field is printable, or of a certain data type |
Re-casting values to conform to reference standards |
unifying different data representations (like country names) |
Slowly changing dimensions in source tables |
propagating new values (like a price change) in warehouse tables |
Solutions
The IRI Voracity ETL/ data management platform and the IRI CoSort tool powering Voracity address all the above data handling issues, plus:
- key and other data transformations
- data type and data format conversion
- basic and advanced report generation
- PII masking and custom test data synthesis
through their common 4GL data manipulation program called SortCL, which is also supported in their Eclipse GUI called IRI Workbench.
With SortCL, you can:
- Remap dates and other data formats using new composite values, and re-usable data types
- Re-assign source key names to adhere to new ontologies
- Pivot (denormalize) for efficient queries, and unpivot(normalize) before loading data to the warehouse
- Use in-line 'data' statements to add strings to any location, and with custom repetition counts
- Use conditional data or field logic to label qualifying values, like inserts vs. deletes vs. updates, etc.
- Use supplied data validation functions or third-party libraries to assess data characteristics or quality
- Use integrated Perl Compatible Regular Expression (PCRE) logic to match and re-assert data patterns
- Handle and report on multiple types of slowly changing dimensions and other changed data
IRI data manipulation tools like CoSortand RowGen (for synthetic test data generation) create unique initial or surrogate primary keys with features like:
- SEQUENCER, a specially named index field with custom starting and increment values
- random data generation, to create random numbers or strings, which can also be sorted and de-duplicated
- ROWID, a field with any name but specified as this attribute with features beyond SEQUENCER
- UUID (and GUID), a field-level transformation/generation function that can be used in any part of a job