PLATFORMS SOLUTIONS BLOGS CONTACT


DataZen Hybrid Data Integration and Ingestion Pipelines - Overview

March 2022 DataZen




DataZen allows you to integrate all your data in a single platform you can host on your Virtual Machines, or in the cloud. With DataZen, you can connect to hundreds of applications, HTTP REST API endpoints, CRM/ERP/HR platforms, relational databases, flat files (including Parquet files) and messaging hubs quickly without complex design surfaces. Finally, DataZen makes it easy to publish data to cloud Data Lakes, including AWS S3 and Azure ADLS / Blob storage so you can further leverage existing cloud-based ETL platforms seamlessly.

From Complex ETL to Dumb Pipe, Smart Endpoint



Modern integration architectures favor an integration pattern called "Dumb Pipe, Smart Endpoint", which is a Microservice Architecture Pattern (also called ELT in the data integration world, as opposed to ETL - Extract, Transform, Load) in which the "Transformation" of data happens at the target system. The ELT pattern is favored in the data integration space because traditional ETL workflows are significantly more complex to maintain and make the data less reusable when it is modified as part of the data ingestion process. For example, SQL Server Integration Services (SSIS) packages and BizTalk Orchestrations are notoriously hard to maintain and require highly specialized skills when the data flows need to be modified.



DataZen offers a simple yet powerful data ingestion pipeline that follows the Dumb Pipe, Smart Endpoint architecture pattern. The data ingestion pipeline first reads data from any source system, including any database platform, ODBC driver, HTTP/S REST API endpoint, flat file, Parquet file, or individual messages from virtually any messaging platform (Kafka, MSMQ, RabbitMQ, EventHub...).

DataZen follows an ingestion process that provides data aggregation, data change filtering, data enrichment, and data partitioning before it is sent to its target.





DataZen Data Ingestion Pipeline Overview

Data Ingestion Pipeline Capabilities



Let's take a deeper look at DataZen's ingestion engine.





DataZen Ingestion Pipeline

Source

DataZen can ingest data from virtually any source system, including HTTP/S REST APIs, relational databases, ODBC drivers, flat files (JSON/XML/CSV and Parquet) stored on FTP sites/UNC Paths/Azure ADLS/AWS S3 Buckets, and messaging platforms (including the Azure EventHub, Azure Bus, AWS SQS, RabbitMQ, MSMQ, Kafka and Google PubSub). Since DataZen also supports Enzo Server and ODBC drivers, DataZen can ingest data from hundreds of data sources.

HTTP/S REST API data sources include support for various authentication schemes, including OAuth 2.0 and automatic Refresh Tokens renewals that many modern APIs require. DataZen supports virtually any HTTP REST API service including Twitter, YouTube, Facebook, LinkedIn, SEMRush, Ceridian Dayforce, MS Dynamics, SalesForce and many other. In addition, DataZen supports various HTTP paging methods that some HTTP REST APIs require.

Aggregation

Most data sources offer an aggregation mechanism. For example, you may need to ingest 100 Parquet files in a single operation (using a list of files based on a wildcard search for example), or call an HTTP/S endpoint in a loop for 50 individual IDs or Account Numbers and treat the output as a single data set before proceeding to the next phase in the ingestion engine. When the source is a queuing system, you can choose to group individual messages in a time window and send the payload as a data set.

Change Capture

The Change Capture layer of the ingestion engine allows you to identify which records should be ignored. Two options are available and can be used jointly: High Watermark tracking, and Automated Change Data Capture (CDC). A High Watermark value is a pointer (such as a date) that is later used as a starting point for detecting changes in a source system, including from HTTP/S REST API endpoints.

In addition, DataZen can also detect changes through an internal differential engine that automatically identifies which records were created, modified, or deleted -- regardless of the source system. This allows DataZen to implement a CDC mechanism transparently on any source system, even if the source system doesn't natively provide a change capture mechanism.

Enrichment

Enrichment is a form of "Transformation" that is meant to modify the selected data only when absolutely necessary or to add calculated fields that are not originally part of the source system but considered essential to the output. This phase also includes the ability to flatten JSON/XML documents into rows and columns automatically, perform HTTP/S lookups, and hash or mask the data before it is saved for security reasons.

After the enrichment phase, DataZen offers the ability to keep its change logs so they can be replayed at any time, or sent to another DataZen server any where. This allows you to fully decouple the source system from target platforms and forward changes to any number of target platforms; the change logs can further be encrypted and distributed any where.

Partitioning

In some cases, the data should be partitioned using a desired distribution pattern before the data should be sent to the target system. Partitioning options include hashing, key value, and dynamically calculated based on the content of the data itself. For example, you may decide to partition the data by Country Code so that each file created contains information specific to a single country, or create a database table per Customer ID.

Target

The target system can be virtually any platform including all those identified in the Source System section. Because DataZen manages the schema of the data automatically, you can send the output to any target system regardless of where the data came from. For example, you can easily send Parquet files into a database table, or vice-versa. You could just as easily configure DataZen to automatically detect changes in a database table and send corresponding messages in a queue.



Conclusion

DataZen offers an easy-to-use hybrid data ingestion engine through a flexible, code-free architecture that follows the "Dumb Pipe, Smart Endpoint" principle through its powerful data ingestion pipeline, and supports hundreds of source and target systems including HTTP REST API endpoints, CRM/ERP/HR platforms, relational databases, flat files (including Parquet files) and messaging hubs.










601 21st St Suite 300
Vero Beach, FL 32960
United States

(561) 921-8669
info@enzounified.com
terms of service
privacy policy

PLATFORM

ENZO SERVER
ENZO DATAZEN

SOLUTIONS

SOLUTIONS OVERVIEW
INTEGRATION
SaaS
CLOUD ANALYTICS

RESOURCES

DOWNLOAD
BLOGS & VIDEOS
IN THE NEWS
ENZO ADAPTERS
ONLINE DOCUMENTATION
TCO CALCULATOR

COMPANY

LEADERSHIP TEAM
PARTNERS


© 2023 - Enzo Unified