CSV to S3 (JSON or Parquet)
Uploading large datasets to AWS S3 can be memory-intensive if you load the whole file. Data-Genie streams data directly to S3 using the AWS SDK's multipart upload capability.
The Strategy
- Use
CSVReaderfor the source. - Use
S3Sinkto manage the S3 connection. - Use
JsonWriterorParquetWriterfor the format.
Implementation (CSV to S3 JSON)
typescript
import { CSVReader, JsonWriter, S3Sink, Job } from '@pujansrt/data-genie';
import { S3Client } from '@aws-sdk/client-s3';
const s3Client = new S3Client({ region: 'us-east-1' });
async function run() {
const reader = new CSVReader('local_data.csv');
// S3Sink streams data directly to the bucket
const sink = new S3Sink(s3Client, 'my-bucket', 'exports/data.json');
const writer = new JsonWriter(sink);
await Job.run(reader, writer);
}
run().catch(console.error);Implementation (CSV to S3 Parquet)
Parquet is a columnar format, making it much more efficient for big data analysis (e.g., in AWS Athena).
typescript
import { CSVReader, ParquetWriter, S3Sink, Job } from '@pujansrt/data-genie';
const writer = new ParquetWriter(new S3Sink(s3Client, 'my-bucket', 'exports/data.parquet'));
// We must define the schema for Parquet
writer.setSchema({
name: { type: 'UTF8' },
age: { type: 'INT64' },
active: { type: 'BOOLEAN' }
});
await Job.run(reader, writer);Key Benefits
- Direct Streaming: No temporary files are created on your local disk.
- Multipart Uploads:
S3Sinkautomatically handles splitting large streams into parts for reliability and performance.