Architecture Overview

Data-Genie is built on three core pillars: Readers, Transformers, and Writers, all connected by Node.js streams.

The Pipeline Flow

Core Components

1. DataReader

Responsible for fetching data from a source and parsing it into a stream of JavaScript objects (DataRecord).

FileSource: Reads from local disk.
S3Source: Streams directly from AWS S3 without downloading the whole file.
HttpSource: Fetches and streams from REST APIs.

2. Transformers & Filters

Act as middleware in the pipeline. They receive a record, modify it (or discard it), and pass it along.

TransformingReader: Apply mapping, renaming, or calculated fields.
FilteringReader: Keep only records that match specific rules.
SchemaValidatingReader: Ensure records match a Zod schema.

3. Resilience & Dead Letter Queues (DLQ)

Data-Genie is designed for "dirty" real-world data. Components like ValidatingReader support a Dead Letter Queue. Instead of throwing an error and stopping the job, invalid records are diverted to a separate DataWriter (like a JSON file or a secondary table), allowing the main pipeline to continue uninterrupted.

4. DataWriter

Takes the final objects and persists them to a destination.

JSON/CSV Writers: Write to disk.
SQLWriter: Bulk insert into databases.
CallbackWriter: Execute custom code for every record (e.g., push to Kafka).
MultiWriter: Fan-out data to multiple destinations simultaneously.

5. Job Orchestrator

The Job class binds everything together. It handles the iteration loop, tracks performance metrics, and provides observability through an event-driven API.

Event-Driven: Emits lifecycle events (start, progress, record, error, complete).
Progress Tracking: Built-in support for TTY progress bars.
Dry Runs: Job.preview() allows inspecting data without writing it.

The "Constant Memory" Secret

Unlike standard libraries that use fs.readFileSync or JSON.parse on the whole file, Data-Genie uses Async Iterators. Only a few records exist in memory at any given time. As soon as a record is written to the sink, it is cleared from memory, allowing you to process infinite streams.

Architecture Overview ​

The Pipeline Flow ​

Core Components ​

1. DataReader ​

2. Transformers & Filters ​

3. Resilience & Dead Letter Queues (DLQ) ​

4. DataWriter ​

5. Job Orchestrator ​

The "Constant Memory" Secret ​