๐ฆ Owl-Watch¶
Owl-Watch is an AWS-native data engineering pipeline for ingesting, processing, and curating data using AWS Glue, Bedrock, and ML techniques.
It handles data extraction, scalable PySpark ETL jobs, machine learning curation, and automated infrastructure deployment through AWS CDK.
Key Features¶
- ๐๏ธ Infrastructure as Code - Automated infrastructure deployment with AWS CDK (TypeScript)
- ๐ Scalable ETL - PySpark ETL jobs for data cleaning and transformation
- ๐ง ML Integration - Machine Learning processing with AWS Bedrock and Lambda for sentiment analysis and data curation
- ๐งช Testing & Validation - End-to-end integration tests with mocked AWS services
- ๐ป Local Development - Built-in local Spark runner for fast job iteration without cloud overhead
Quick Start¶
Prerequisites¶
- Python 3.11+
- Node.js 20+
- AWS CLI configured
- Hatch (Python build tool)
Deploy Infrastructure¶
Running ETL Jobs Locally¶
You can run ETL jobs locally for development and testing using the built-in Hatch scripts. The local ETL runner uses local Spark instead of AWS Glue, creates sample data, outputs results to your local filesystem, and shows a transformed data preview.
# Run any ETL job with custom arguments
hatch run run-etl --job-type email_communication --output-path ./my-output
# Run specific ETL jobs with defaults
hatch run run-email-etl
hatch run run-slack-etl
# Run with custom output path
hatch run run-email-etl --output-path ./email-results
Testing¶
Integration tests use pytest and moto to mock AWS services.