Challenges
Even after consulting and tuning are employed, big data volumes (i.e., more than one million rows) can be slow to transform, particularly without an expensive hardware or version upgrade of Talend.
Big data bottlenecks are large sorts, joins, aggregations, loads, and sometimes unloads. Parallelization or optimization in other layers or tools can be unwieldy, if not expensive, and may adversely affect performance for other users.
From a security standpoint, Talend may not provide the data discovery, classification, or masking functions, or test data capabilities that data governance officials and application developers need. Finally, there may be a cost and complexity issue as you add more users and modules, respectively.
Solutions
Speed Talend Transforms
Speed sorts, joins, and aggregations for Talend ETL flows using a tSystem call to CoSort Sort Control Language (SortCL) programs. Run large data transformations multiple time faster, without crashing for lack of memory, and without encumbering other jobs in Talend, your DB, or BI tool. Also, specify file-format and data-type conversions, field-level masking and cleansing functions, custom report layouts, and pre-sorted load files.
See this article for benchmarks and advice around improved performance via tSystem calls available to CoSort or Voracity users. Speed and combine transformation and other big data staging jobs in much less time and memory space.
Mask PII in Talend
Data at rest in tables and flat files within Talend can be sensitive, containing personally identifiable information that is subject to confidentiality restrictions and data privacy laws. Both IRI Voracity and IRI FieldShield can find and protect PII in structured data sets in any ODBC-connected database or flat-file format.
Your business rules dictate the masking function you use to de-ID each column; e.g., format-preserving AES-256, FIPS-compliant OpenSSL, 3DES, and/or GPG encryption, lookup-value substitution (pseudonymization), string shifting, blurring, hashing, redaction, custom expression logic, or user-defined field functions. You can also score re-ID risk.
Build Talend Test Data
IRI Voracity through its constituent (or standalone) IRI RowGen software product generates safe, realistic test data using COBOL or CoSort metadata, and any RDB data models connected through JDBC. Use RowGen to create compliant, realistic test data from random generation and/or set-file selection, and customize it even further with built-in data manipulation and formatting functionality.
Re-Platform Talend Jobs
Automatically convert mappings in Talend to faster, simpler, and less expensive ETL operations in IRI Voracity using erwin (formerly AnalytiX DS) Mapping Manager or Code-Automation Frameworks (CATfx). This proven technology, along with erwin Lite Speed Conversion services, routinely gives ETL architects and the CIO/CFO suite the ability to save tens or hundreds of thousands of dollars per year.