Creating Test Data for Cassandra

by Nathan Dymora

UPDATE: Q1’17: Native JSON file generation, and available JDBC and ODBC drivers for NoSQL RDBs like Cassandra, now provide a more seamless connection than the CSV/import approach shown below. Please contact rowgen@iri.com or voracity@iri.com for details.

DataStax/Apache Cassandra cannot readily generate and populate realistic prototypes for testing queries or planning capacity; the Cassandra stress tool only inserts random values at the time of this writing. This article explains how to use the IRI RowGen test data synthesis product (or the IRI Voracity data management platform which includes RowGen and IRI FieldShield & DarkShield data masking products) to create and load artificial but realistic test data into Cassandra via a CSV file with production data characteristics.

In this example, we know that our table will contain customers with Usernames, First and Last Names, Email Addresses, and Credit Card Numbers. To create our test data, we must first generate some set files containing test values for each of those categories. A set file is a list of one or more tab-delimited values that may already exist, or need to be generated manually or automatically from database columns through the ‘Generate New Set File’ wizard in IRI RowGen.

Of course, you will need to consider the structure and content of the test data for your table needs. See this article for typical planning considerations.

Generating Names

1) Create a compound data value (first and last names combined) job script named “CreateNamesSet.rcl” that RowGen can execute to produce a set file. Call the output “User.set” because these names will also be used as the basis for our usernames.

2) Create three fields to be generated in User.set: last name, tab separator, and first name. Name the first field “LastName” and choose the method that will select values from an IRI-provided set file called “names_last.set”. Add the literal value “\t” to add a tab separator, and then repeat the process used for LastName and FirstName values using “names_first.set”.

3) Run CreateNamesSet.rcl with RowGen, either on the command line or from the IRI Workbench GUI, to produce the tab-delimited User.set file of first and last names, which will be used in both the generation of usernames and in the final test file build that populates our prototype collection.

Generating Usernames
For Usernames, we will create a set file that utilizes the Users.set file generated above. Usernames for this example will combine last name, first initial, and a randomly generated number between 100 and 999.

1) Create a new RowGen job script with the Compound Data Wizard, call it “CreateUsernamesSet.rcl”, and name the output set file “Usernames.set”.

2) Build compound username values with three components named Part1, Part2, and Part3.

3) For Part1, choose the method that will select values from (browse to) the previously-generated User.set file, and specify ‘ALL’ for the selection type to maintain the association between users, usernames, and email addresses. Set the size to 5.

4) For Part2, repeat the process used for Part1, except for Selection type, select ‘Row’ and set Column Index to 2. Set the size to 1. This guarantees all the last names will be used in the generation, and that the first letter of the first name in the same row is appended to the user name.

5) For Part3, specify the generation of a numeric value between 100 and 999 to suffix a random integer with each username.

Upon execution of CreateUsernamesSet.rcl, we see that each username contains the first five letters of their last name, then their first initial, then a random 3-digit number:

Generating Emails
Next we will create an email set file that appends the username values with randomly-selected domain names. Because some email services are more popular than others, we will also create a weighting system to reflect a higher frequency of Yahoo and Gmail domains.

1) Run RowGen’s ‘New Custom Test Data’ job wizard to create a job called “CreateEmailsSet” that produces a set file called “Emails.set”.

2) Produce the username part of the email. In the Test Data Definition dialog, click New Field, and rename the first field Usernames. Double-click on it to launch the Generation Field dialog and “Define …” its Set file as Usernames.set. Set the size to 9 and click OK.

3) Produce the domain part of the email (which includes the @ symbol). In the Layout Fields dialog, click New Field and rename it to “address” and double-click on it. In the Generation Field dialog, specify a ” ,” with a position of 10 and a size of 20. In the Data Generation / Data Distribution section below, click “Define …” to name a new data distribution of items “WeightedEmails”.

4) In the New Distribution Wizard, chose ‘Weighted Distribution of Items’ and enter these items into the ratio and literal text boxes respectively, then add each to the list.

(32 | @gmail.com), (32 | @yahoo.com), (2 | @ibm.com), (4 | @msn.com), (2 | @ymail.com), (2 | @inmail.com), (2 | @cnet.net), (2 | @chase.org), (1 | @iri.com), (1 | @gdic.com), (1 | @aci.com), (2 | @oracle.net), (1 | @gmx.org), (4 | @aol.com), (2 | @inbox.com), (2 | @hushmail.com), (2 | @outlook.com), (2 | @zoho.com), (2 | @yandex.net), (2 | @mail.com)

After you enter these values, click Next in the original wizard to move into the Data Targets dialog. Use “Add Data Target …” to specify the output file “Email.set”. This will also be used at collection-build time.

The email we set the highest weights for (Gmail and Yahoo) show up most frequently, with others showing up periodically.

Generating Credit Card Numbers
Lastly, we will create computationally valid card numbers in the format XXXX-XXXX-XXXX-XXXX. The first four digits reflect actual Issue Identifier Numbers (IIN) of various credit card companies, and the last digit verifies the cards’ authenticity.

To do this, create and run a new (empty) job. Call it “CreateCCNSet.rcl” (or .scl), and populate it with the script below to create “CCN.set”. The /INCOLLECT value in RowGen scripts determines the number of rows generated.

RowGen’s purpose-built CCN generation function, ccn_gen(“ANY, “-“) is called to populate this field. Note similar functions exist for US and Korean social security numbers, and the national IDs of Italy and The Netherlands.

Creating the Final Test File
With all set files built, it is time to use them in the test CSV file we’ll create and export to a Cassandra collection.

1) Run RowGen’s ‘New Custom Test Data’ job wizard to create a job called “CreateCassUserData.rcl” that will generate the Customers.csv file, the file we will then export to Cassandra.

2) Click “Layout Fields …” to enter the Layout Fields Dialog. Click New Field and rename the first field to Usernames. Double-click on it to launch the Generation Field dialog and “Define …” its Set file as Usernames.set; then select ALL for its selection type.

3) Click New Field and rename the second field to LastNames. Double-click on it to launch the Generation Field dialog and “Define …” its Set file as Users.set; then select ALL for its selection type.

4) Click New Field and rename the third field to FirstNames. Double-click on it to launch the Generation Field dialog and “Define …” its Set file as Users.set; then select ROWS for its selection type and set the column index to 2.

5) Click New Field and rename the fourth field to Email. Double-click on it to launch the Generation Field dialog and “Define …” its Set file as Emails.set; then select ALL for its selection type.

6) Click New Field and rename the fifth field to CreditCardNumbers. Double-click on it to launch the Generation Field dialog and “Define …” its Set file as CCN.set; then select ALL for its selection type.

7) After you enter these values, click Next in the original wizard to move into the Data Targets dialog. Use “Add Data Target …” to specify the output file Customers.csv; then run the script in the Workbench or on the command line to generate that file:

rowgen /spec=CreateCassUserData.rcl

Note that RowGen, in addition to producing this CSV file at runtime, could have also produced multiple other file, database, formatted-report, named-pipe, procedural, and even real-time BIRT display, with fields from the generated test data, all at the same time.

Importing to Cassandra
To import the CSV file into your Cassandra Database, call the following COPY command:

COPY <Table you are importing data to> (field1fromCSV, field2fromCSV, ...) FROM '<Path to CSV>';

Here are the records in the test table

For more information on the generation options available, see the Test File Targets section at: http://www.iri.com/products/rowgen/technical-details.

Generating Test Credit Card Numbers

PII Masking in Cassandra with IRI FieldShield