JSON and the Truth it Captures

by Paul Friedland

“Have you stopped speeding?” You could probably object to a leading question like this in court, but what happens when an important question with only a yes or no answer is solicited on a mandatory form, and the response becomes part of an actionable database record?

Not So Funny

Genuine questions are usually offered with more than a “yes” or “no” option and some even have room for free-form text responses. As leading questions go, no reply would be best. But whether innocently posed or not, a yes or no answer gives validity to the question. And someday, the context could be forgotten and only the answer will be remembered.

Information collected about people usually winds up being processed by computers where the stored data can be searched and analyzed. When the subject is more narrow, that captured data can be limited. For statistical analysis (counts, averages, trends), personalized information is not needed and analysts can make decisions based on computed aggregates.

When topics are broader and subjects are more complex however — say, entries on a job or loan application, or in a match-making situation — you do not want to rely on a limited algorithm to approve (or disapprove) someone.

This article identifies a fast and flexible data and processing system for handling this kind of data and making better use of it. It combines industry-standard, semi-structured JSON data common in representing form entries, and IRI’s proven big data manipulation engine, which can now process JSON and help find with truth.

Why JSON?

JavaScript Object Notation is a language that describes data. It can be generated from JAVA, C, COBOL, etc. The data is self-documenting and human readable. This is very important for updates and corrections.

Data are carried in pairs (name : value). Structures (parent-child relationships) are carried as curly braces, brackets, and commas. The language is defined recursively so that very complex tree-like relationships are possible.

JSON data is in the following forms:

numeric – integers and decimals used in expression evaluation, sums, averages, etc.
logical – true, false, null … for decision making; e.g., we might see “Speeding” : null
string – UTF-8 characters used in searches, mapping, masking, reporting, etc.

The metadata inherently associated with the incoming data indicates what may be present. And because JSON data are self-describing, it need not appear in a specific order from record to record. In fact, some items need not appear at all.

Metadata only describe what might be present in the current execution and there is no penalty for defining data that may never appear in any record. But the fact that the data does not appear in the current record can be tested and acted upon.

Finer Points

The creation of JSON data is not constrained to specific content. JSON-producing applications can generate different content at different times and feed the same receiver. Answers on a form can be empty. Or there could be multiple entries (arrays) in one record and only one in another. Some records can have special comments.

Metadata from JSON sources can be used by the SortCL data manipulation program in the IRI CoSort product or IRI Voracity platform to evaluate data in a current record and know what is missing, or be mapped to different outputs on a conditional basis.

SortCL can also perform other functions on JSON data, including: transformation, cleansing, anonymization, and calculation. It can also create actionable information in custom reports, or hand-off prepared data subsets for analysis and display in other software.

What does this latter capability mean to the person who never started speeding in the first place, but had to put ‘no’ on the fixed form? It means an opportunity to respond correctly to the questioner. By way of example in SortCL, we might have a JSON key-value pair called Admission, represented as conditional field that displays an action to take based on a pre-defined test called Speeding:

/FIELD = (Admission, IF Speeding THEN "Never" ELSE "Call the Cops")

Naming fields also allows you to join someone’s original record with a key value in common in another source to create or update that person’s profile in a results table. SortCL users can also capture changes, further modify the records, generalize quasi-identifying traits for research, etc.

IoT data collectors and processors should also take note; they can join and otherwise transform and map JSON and non-JSON records in SortCL for aggregation on the edge or data mashups.

More on IRI, SortCL & JSON

Founded in 1978, IRI develops and supports a wide range of fast data manipulation and management software for uses cases ranging from ETL and analytics to data quality and de-identification. IRI’s mainline data processing product, CoSort, transforms huge files and tables as part of DW ETL and data wrangling operations, DB-load pre-sorts, and legacy sort and data migration jobs.

JSON sources and targets are supported in the CoSort version 10 SortCL program, and the larger IRI Voracity platform for data discovery, integration, migration, governance and analytics. SortCL jobs all share metadata descriptions with syntax and semantics for data definition and manipulation.

More specifically, it supports these JSON data handling features:

finding and extracting keys and values from unstructured text
filtering, sorting, joining, aggregating, and other transformations
expression evaluation (numeric and logical) across and down for BI
data validation, de-duplication, filtering, and templating for data quality
masking (encryption, redaction, pseudonymization, etc.) for privacy law compliance
data type, record layout, and file-format conversion for data and DB migration needs
JSON test data (file) generation for DevOps and NoSQL DB prototyping

For ETL and data migration more specifically, you can use it to move to and from JSON and:

ODBC
CSV
LDIF
COBOL
XML and others

As a result:

structured data can be read and converted to JSON; and/or,
JSON data can be read and re-casted in other file formats.

Conclusion

JSON and SortCL are a powerful combination for transforming pure data into information and knowledge. SortCL directly processes JSON data for production ETL and analytics, and can extract and govern it for informational value, and to improve data quality, privacy and lineage.

Support for JSON is available in all IRI products using the SortCL executable released with CoSort v10, including FieldShield, NextForm, RowGen, and Voracity.

Anonymizing Indirect Identifiers to Lower Re-ID Risk

Connecting to Snowflake for Data Integration & Security