Architecture¶
Owl-Watch utilizes a fully AWS-native serverless architecture to orchestrate large-scale data engineering workloads. It separates data into multiple zones and uses Glue for ETL processing and Lambda/Bedrock for advanced machine learning curation.
High-Level Flow¶
- Ingestion: Raw data is uploaded or streamed into the Raw S3 bucket.
- ETL Processing: AWS Glue jobs perform distributed data cleaning and transformation using PySpark.
- Storage: Cleaned data is stored in the S3 Cleaned Bucket.
- ML Curation: Machine learning processing (using AWS Bedrock via Lambda functions) further refines and analyzes the data, producing sentiment analysis and insights.
- Output: The finalized curated data is stored in the S3 Curated Bucket for downstream consumption.
Project Structure¶
cdk/- AWS CDK infrastructure (TypeScript)lib/stacks/- Data, Glue, and Monitoring stackslib/utils/- Asset and resource creation utilitiesexecution/- PySpark ETL and ML code (Python)core/- Job runners, config managers, and factory patternsjobs/- Specialized ETL and ML jobsmodels/- Data models and quality constraintsschemas/- Standardized schemas for communication dataintegration_tests/- Integration tests (Python, pytest)