CoSort Tackles Another Major ETL Project
Category
DW Design, Administration & Management
Reviewer
Byung-Chul, Min, adviser for NACF.
Background
Korea's National Agricultural Cooperative Federation (NACF) was established in 1961 to increase agricultural productivity and enhance the welfare of five million member farmers in some 1,200 cooperatives. The NACF's business network handles marketing, processing, supply of farm inputs and consumer goods, credit and banking, insurance, warehousing, transportation, farm extension and relevant support activities such as research and publication. The NACF has a 40 percent share of Korea's agricultural market and is also one of the country's largest deposit banks.
Platforms
The NACF's data warehouse is staged on a multi-CPU IBM RS/6000 S80 server running AIX 4.3.3.
Problem Solved
The NACF's first attempt at a data warehouse project failed in 1999 because of the complexity of the database design and a tight schedule. The consultants had difficulty finding a proper ETL solution for handling the massive data volumes involved. The initial table data extracted from the main DB log was estimated above 100GB, and the data to be loaded exceeded 400GB. The total size of the data warehouse was approximately 1TB. In addition to the initial load, daily jobs needed to process 3GB of data and load 10GB to the data warehouse. NACF's second attempt at building the warehouse began in mid-2000, with successful project completion in the spring of 2001. By choosing and using CoSORT as a massive flat file manipulation solution, it took only two months to design and finish all of our ETL process after the database design was set up. Both our consultants and engineers were surprised at and grateful to CoSORT Korea for the implementation turnaround.
Product Functionality
The major features of CoSORT's sort control language (sortcl) program used in the NACF data warehouse are sorting, merging, joining (matching), data conversion, record filtering and summarizing (aggregation); in other words, flat file ETL. As the data sizes are very big, CoSORT's performance tuning options were naturally critical. Through resource control files, we can tweak how much memory each sort process will use (maximum and minimum amounts), shared memory, I/O buffer sizes, workspace assignments, and so on. User exits, as well as CoSORT's major features, were written and linked with sortcl to create tables using complex business logic.
Strengths
CoSORT's speed in sorting and joining massive, multi-gigabyte files was the key to NACF's data warehouse success. CoSORT's built-in data type conversion feature (EBCDIC to ASCII) was also helpful and reduced the number of data transformation steps we need to perform.
Weaknesses
CoSORT was unable to perform joins on more than two files at a time. Our developers would like to see a multitable join within a single sortcl script; currently they accomplish this by piping scripts.
Selection Criteria
CoSORT was chosen because of the speed given to the process. Sorts of 1GB were completed in less than four minutes on our heavily loaded system. Massive file joins at I/O speed had another very strong appeal for NACF developers. NACF developers had reviewed a popular ETL tool prior to CoSORT, but gave up on it because of its slow sort and join performance and relative inconvenience and expense.
Deliverables
Although CoSORT is actually a suite of sorting programs, third-party sort replacements (drop-ins) and API calls, we focused on its most feature-rich interface sortcl. The sort control language is a mainframe sort and SQL-like 4GL for defining and manipulating various flat file formats. Whether data definition files, job specification scripts or some combination, sortcl scripts are text files we easily create and modify using VI or CoSORT's Java GUI for sortcl called gui2scl. The! GUI allows users with no sortcl knowledge to read, write, modify and launch sortcl job scripts on the client where the GUI is and/or a remote Windows or UNIX server. The user exits (in this case, the input procedure) have to be customized by the user, but the product is flexible enough to accept them in combination with natively supported commands. The sortcl program can produce all manner of structured, summary and detail reports, as well as handoffs to the database.
Vendor Support
Both the developers at CoSORT/IRI in the USA and the assisting implementers at CoSORT Korea were highly responsive to our customization requests.
Documentation
The online (and matching hard copy) documentation was translated into Korean, including many sortcl scripting examples. Basic usage of sortcl was so easy that the CoSORT manual was rarely used.