Creating Test Data for Pentaho
This article is third in a 3-part series on using IRI products to expand functionality and improve performance in Pentaho systems. We first demonstrate how to improve sorting performance, and then introduce ways to mask production data, and create test data, in the Pentaho Data Integration (PDI) environment.
Abstract: IRI RowGen generates safe, realistic test data for multiple database and file targets, according to business rules. By calling RowGen jobs from Pentaho, you can supply data with the structure and relationships needed for immediate ETL and BI testing, but not expose personally identifiable information.
While Pentaho Data Integration (PDI) has a number of database tools, it does not have the native capability to create safe, intelligent test data. This becomes important when you want to prototype ETL operations, share new views or reports with co-workers, and develop new applications without relying on production data.
IRI RowGen software populates tables and flat files with benign test data for use in Pentaho and other applications. You would use the Shell step in Pentaho to call pre-defined RowGen jobs (or batch job) scripts.
We’ll begin the example with empty tables to be populated. This means their definitions exist. RowGen will rely on the DDL information to generate structurally and referentially correct test data soon. The Pentaho view of this stage setting is shown below:
The next step is to build the test data using RowGen job scripts automatically created in the IRI Workbench GUI, built on Eclipse™. The GUI’s New DB Test Data job wizard for RowGen will connect to the same tables, parse their DDL, and produce a data generation batch operation that will run in Pentaho’s Shell step:
While you can certainly add the Shell step to a larger Pentaho project, I’m only showing the steps needed to run the test data generation job. Create the job with a Start step and use the Shell step to reference the RowGen batch file created above:
After the Pentaho/RowGen job is executed, you will see your tables populated with the test data. Explore the data source again in Pentaho:
For questions about the use of RowGen or its callability from third-party applications, email rowgen@iri.com. make sure you also saw our previous article on masking production data in Pentaho.