Skip to content

Architecture Overview

Data-Genie is built on three core pillars: Readers, Transformers, and Writers, all connected by Node.js streams.

The Pipeline Flow

Core Components

1. DataReader

Responsible for fetching data from a source and parsing it into a stream of JavaScript objects (DataRecord).

  • FileSource: Reads from local disk.
  • S3Source: Streams directly from AWS S3 without downloading the whole file.
  • HttpSource: Fetches and streams from REST APIs.

2. Transformers & Filters

Act as middleware in the pipeline. They receive a record, modify it (or discard it), and pass it along.

  • TransformingReader: Apply mapping, renaming, or calculated fields.
  • FilteringReader: Keep only records that match specific rules.
  • SchemaValidatingReader: Ensure records match a Zod schema.

3. Resilience & Dead Letter Queues (DLQ)

Data-Genie is designed for "dirty" real-world data. Components like ValidatingReader support a Dead Letter Queue. Instead of throwing an error and stopping the job, invalid records are diverted to a separate DataWriter (like a JSON file or a secondary table), allowing the main pipeline to continue uninterrupted.

4. DataWriter

Takes the final objects and persists them to a destination.

  • JSON/CSV Writers: Write to disk.
  • SQLWriter: Bulk insert into databases.
  • CallbackWriter: Execute custom code for every record (e.g., push to Kafka).
  • MultiWriter: Fan-out data to multiple destinations simultaneously.

5. Job Orchestrator

The Job class binds everything together. It handles the iteration loop, tracks performance metrics, and provides observability through an event-driven API.

  • Event-Driven: Emits lifecycle events (start, progress, record, error, complete).
  • Progress Tracking: Built-in support for TTY progress bars.
  • Dry Runs: Job.preview() allows inspecting data without writing it.

The "Constant Memory" Secret

Unlike standard libraries that use fs.readFileSync or JSON.parse on the whole file, Data-Genie uses Async Iterators. Only a few records exist in memory at any given time. As soon as a record is written to the sink, it is cleared from memory, allowing you to process infinite streams.

Released under the MIT License.