Big Data Transformations with CoSort (Structured Data)
In 1992, Digital Equipment Corporation (DEC, long since acquired) asked IRI to develop a 4GL interface to CoSort in the syntax of the VAX VMS sort/merge utility. The result of that effort was the now widely adopted Sort Control Language (SortCL) program that is used to define data layouts and manipulations that go way beyond sort/merge.
SortCL now handles everything from data transformation and reporting to data migration and protection, and is the core of multiple spin-off products and a metadata infrastructure modeled and managed in the IRI Workbench GUI, built on Eclipse™.
In 1999, Database Trends Magazine studied the data transformation functions then in SortCL and labeled CoSort “The ETL Engine” in an edition dedicated to data warehousing. Indeed, since the mid 1990’s, hundreds of DW architects and thousands of EDW, ODS and database users around the world have deployed SortCL scripts directly, or within applications they use, to transform massive amounts of sequential data with built-in functions they can run alone or combination, such as:
Sort/Merge | Match/Join |
Select/Filter | Aggregate |
Find/Replace | PCRE |
Lookup | Pivot |
Rank | Scrub/Cleanse |
Remap/Reformat | Substring |
Convert | Validate |
In addition to the price-performance advantages made possible with the underlying CoSort engine and its
- linearly scaling, multi-threaded, co-routine sorting algorithm
- sophisticated memory management and good neighbor I/O
- same-script/same-pass marriage of sorting to joins and aggregations
- thread-safe APIs, and custom input, compare, output, and field functions,
- cross-platform, by running on every flavor of Unix and Windows with the same scripts
- self-documenting, via a language familiar to both mainframe and SQL users
- easily invoked, and widely interconnected to third-party applications
- interchangeable, through scripts you can easily convert to and from.
IRI’s sweet spot in the market remains the integration and staging of huge flat files which include bulk database extracts (e.g. from IRI FACT operations), mainframe datasets, web and IoT device logs, spreadsheet and application exports, PoS server and telco switch (CDR) feeds, COBOL and shell programs, and so on. With CoSort (SortCL) running in IRI Voracity workflows that include FACT (E) and table creation and bulk load (L) steps, end-to-end ETL jobs are built and run quickly in Eclipse or on the command line.
In Voracity, most SortCL jobs can run either in the default CoSort engine, or seamlessly in Hadoop MapReduce, Spark, Spark Stream, Storm, or Tez. Either option provides an extremely high-speed, simple, and low-cost approach without changing code.
More advanced users can write custom detail and summary reports and protect data at the field level in the same SortCL job script and I/O pass with their transforms. Data in HDFS, unstructured sources, or in otherwise non-sequential/non-relational formats, can pass through drivers, or memory through custom input procedures that structure and feed that data to CoSort (or Hadoop) for fast transformations and hand-offs to DB loads, data marts, visualization tools, etc.