Challenges
"The use of pre-sorted summaries (aggregates) is the single most effective tool the data warehouse designer has to control performance."
"The Data Warehouse Toolkit" - Dr. Ralph Kimball
The main problem with aggregate data processing is speed in volume. SQL "group by" aggregations run in databases for star schemas and ETL operations drag as data volumes increase.
Another problem in performing aggregate data analysis is that most COTS products, including ETL tools, do not support advance data aggregation techniques like runnning totals, or the combination of aggregate functions with cross-calculation or custom report formatting.
Solutions
The SortCL program in the IRI CoSort package and IRI Voracity data management (ETL +) platform computes huge fact-table, drill-down, and roll-up aggregates with extraordinary ease and efficiency. SortCL combines parallel data aggregation with high-volume sort, join, and report operations in the same job script and I/O pass.
Use SortCL to rapidly produce output values derived from accumulated detail records -- at multiple break levels (including the final aggregate). The related data aggregation techniques are:
- Summarization (Totals)
- Averaging
- Maximum
- Minimum
- Counting
- Multiplication
- Ranking
- Sequencing
- Standard Deviation
Here is a sample data aggregation job in the IRI Workbench graphical IDE for Voracity ETL operations showing the CoSort SortCL job script and its transform mapping diagram:
Here's more of what you can do:
- Display final values at the end of a file; i.e. roll-up aggregations, and exploit SortCL wherever large, simultaneous sorting and grouping work is required.
- Group data based on inter- and intra-record Boolean break conditions for EIS summary, or drill-down analyses on many types of numeric data. This is a great feature for writing detail and summary reports as well as aggregate fact table records.
- Reflect transaction aggregates based on various field combinations (e.g., total sales by SKU, employee, and location) by spinning off individual sales reports.
- Format summary records differently at each level. Write each level to a separate target table or file, and merge the sub-levels into a structured report for aggregate data analysis.
Other options include running (accumulating) aggregates, Sybase-style windowed aggregates, and aggregates on cross-calculated values. Use these features for ad hoc presentations and complex trend analyses, with (or without) all of the other transformations and formatting options.