Managing MarkLogic Data in IRI Voracity
The IRI Voracity data management platform now supports the MarkLogic NoSQL database as a source for structured data discovery (classification, profiling, and search), integration (ETL, CDC, SCD), migration (conversion and replication , governance (data cleansing and masking), and analytic (reporting and wrangling) jobs. In this article, I explain how to set up the MarkLogic server for SQL operations, and configure Voracity to source MarkLogic data via ODBC in IRI Workbench.
MarkLogic Server is the Enterprise NoSQL Database that combines database internals, search-style indexing, and application server behaviors. It uses XML documents as its data model, and stores the documents within a fully ACID compliant transactional repository. It indexes the words and values from each of the loaded documents, as well as the document structure. And, because of its unique Universal Index, MarkLogic doesn’t require advance knowledge of the document structure (its “schema”), nor complete adherence to a particular schema. Through its application server capabilities, it is programmable and extensible.
To set up the MarkLogic server for ODBC access, I need to create:
- a SQL database
- range indexes for the database
- data fields
- an ODBC apps server
Set up MarkLogic Server
Install MarkLogic on your network and reach it through the browser in IRI Workbench for convenience. Select Windows=>Show View=>Other=>Internal Web browser and navigate to http://hostname:8001:
Configure ODBC Server & Create the Database
To create a SQL-ready database in MarkLogic, the first Configure tab step is to create a “forest” and attach it to the database, which I named SQLData1. I then created an ODBC Server (shown below, via Groups, Apps Server) named SQL with Port number 5432. In the modules field, select (file system) to store MarkLogic documents, and in the database field, select the SQLData1 database we created.
Click OK to save these settings. Next, click to expand Databases in the explorer pane, and under SQLData1, create range element indexes to define each column name and data type for use in multiple tables within a schema we will later call “main”:
Creating Tables (Views) in MarkLogic
Given that we have previously defined columns for use, we can assign them to a new schema which will have a series of defined tables or views. To create the schema, use a Curl command like this:
curl -X POST –anyauth –user admin:admin –header “Content-Type:application/json” -d ‘{“view-schema-name”: “main”}’ http://localhost:8002/manage/v2/databases/SQLData1/view-schemas/?format=json
Once I create the schema ‘main’ I will create a view called ‘emps’, which contain some of the previously defined range element index IDs (or columns); e.g., ‘firstname’, ‘lastname’, and ‘employeeid’ range indexes. Employeeid uses the integer data type, while FirstName, LastName use a string.
Curl Code in Cygwin prompt
Through these views, SQL inserts and queries via ODBC will work in MarkLogic’s Query Console (below), and thus, operations on this data in IRI Voracity as well. For more detailed instructions in this area, refer to https://docs.marklogic.com/guide/sql/setup.
Loading & displaying data in MarkLogic Query Console
In the IRI Workbench internal browser, I can access the MarkLogic Query Console to do ad hoc queries, insert XML or JSON documents, or RDF Triples. In this case, I will use it to enter (load / insert) the actual data elements into my now SQL/ODBC-ready view, emps, via JavaScript. Each row is stored as a JSON document in this case, and can be queried with SQL syntax.
ODBC connection in IRI Workbench
Once the backend DB is configured, we must configure its ODBC driver for use with Voracity. From the IRI Workbench, I click on the toolbar’s IRI icon, and select Data Connection Registry. From there, click add:
From the ODBC Data Source Administrator window, use the System DSN tab and Configure … to enter the connection parameters to MarkLogic. In the MarkLogic SQL ODBC Driver Setup window, enter the database name we created (in this case SQLData1). The server name is localhost, and username and password match what’s in use with the MarkLogic server and port (5432). Test and save the connection.
Retrieve data from MarkLogic (View) & Load in Oracle
I next need to create an IRI data definition file (DDF) to make use of the MarkLogic data in each view. To do this in the IRI Workbench GUI for Voracity (or other IRI products using DDF), I will use the Import Metadata Wizard. First, I create a New IRI Project in the Workbench Project Explorer to hold my work:
Next, from the IRI Menu=> Select Import Table metadata. Select the Data Source Name (DSN) “MarkLogicSQL” and the table “main.emps”:
The resulting DDF file is shown below; note that my connection to MarkLogic must remain open while I’m interfacing with it:
This DDF is now available for use in any IRI job script sourcing this table. I will use it in my sort and mask application below.
From the CoSort toolbar menu (stopwatch icon), select New Sort Job. After naming my job script, I am taken to the data source specification. I locate my ODBC source for the MarkLogic table, and then select Add Existing Metadata to provide the necessary field layouts for the CoSort program. Voracity uses SortCL to manipulate, mask, and report on MarkLogic and other ODBC and file-based data sources.
I can then specify one or more sort keys:
In the next screen I define and format my target(s), where I also specified a redaction rule to mask sensitive portions of the column values on output:
I also redacted the FIRSTNAME and LASTNAME column values with the replace_chars(FIRSTNAME, “*”). Protection rule. See this video on how to use IRI Workbench dialogs (or wizards) to redact data and otherwise mask sensitive data in your target fields.
The code and the output produced in IRI Workbench
The job produced in the wizard connects to MarkLogic, sorts and masks the data in the main.emps view, and sends the output to both an Oracle DB and flat-file (standard output) target, both shown below:
See this article if you are interested in loading bulk (test) data files generated by Voracity (or IRI RowGen) into MarkLogic. If your data in MarkLogic is unstructured and you need to search, extract or mask its XML (or JSON) exports, you’d use DarkShield (also in Voracity) per this article. If you have any questions or need help using MarkLogic as a Voracity data source, contact voracity@iri.com.