Architecture Overview
Data-Genie is built on three core pillars: Readers, Transformers, and Writers, all connected by Node.js streams.
The Pipeline Flow
Core Components
1. DataReader
Responsible for fetching data from a source and parsing it into a stream of JavaScript objects (DataRecord).
- FileSource: Reads from local disk.
- S3Source: Streams directly from AWS S3 without downloading the whole file.
- HttpSource: Fetches and streams from REST APIs.
2. Transformers & Filters
Act as middleware in the pipeline. They receive a record, modify it (or discard it), and pass it along.
- TransformingReader: Apply mapping, renaming, or calculated fields.
- FilteringReader: Keep only records that match specific rules.
- SchemaValidatingReader: Ensure records match a Zod schema.
3. Resilience & Dead Letter Queues (DLQ)
Data-Genie is designed for "dirty" real-world data. Components like ValidatingReader support a Dead Letter Queue. Instead of throwing an error and stopping the job, invalid records are diverted to a separate DataWriter (like a JSON file or a secondary table), allowing the main pipeline to continue uninterrupted.
4. DataWriter
Takes the final objects and persists them to a destination.
- JSON/CSV Writers: Write to disk.
- SQLWriter: Bulk insert into databases.
- CallbackWriter: Execute custom code for every record (e.g., push to Kafka).
- MultiWriter: Fan-out data to multiple destinations simultaneously.
5. Job Orchestrator
The Job class binds everything together. It handles the iteration loop, tracks performance metrics, and provides observability through an event-driven API.
- Event-Driven: Emits lifecycle events (
start,progress,record,error,complete). - Progress Tracking: Built-in support for TTY progress bars.
- Dry Runs:
Job.preview()allows inspecting data without writing it.
The "Constant Memory" Secret
Unlike standard libraries that use fs.readFileSync or JSON.parse on the whole file, Data-Genie uses Async Iterators. Only a few records exist in memory at any given time. As soon as a record is written to the sink, it is cleared from memory, allowing you to process infinite streams.