Projects
Data engineering projects demonstrating end-to-end pipelines, cloud warehousing, and database optimization
Projects In Development
Urban Mobility & Transportation Optimizer
Building an end-to-end real-time data pipeline that processes live transportation data to help optimize commute decisions. The system integrates multiple data sources including public transit GPS locations, bike-share availability APIs, and weather data through a Lambda Architecture combining both streaming and batch processing layers.
Implementing Apache Kafka for real-time event streaming to capture live vehicle positions and availability updates every 30 seconds. The pipeline processes geospatial data to calculate optimal routes, predict delays using machine learning models, and provide real-time recommendations. Apache Airflow orchestrates daily batch jobs for historical pattern analysis, ML model retraining, and data warehouse updates.
The system features a multi-layered storage architecture: PostgreSQL for operational data (current vehicle states), S3 data lake for raw event storage, and Snowflake for analytical queries on historical patterns. Includes comprehensive monitoring, data quality validation, and an interactive Streamlit dashboard displaying live transportation status with predictive delay alerts and multimodal route recommendations.
Database Performance Optimization Study
Planning a comprehensive performance comparison of row-based (PostgreSQL) and columnar (ClickHouse) database architectures for analytics workloads using the industry-standard TPC-H benchmark dataset. The study will systematically evaluate query performance, storage efficiency, compression ratios, and memory utilization across different data scales ranging from 1GB to 100GB.
Will test multiple optimization techniques including indexing strategies (B-tree, hash, bloom filters), partitioning schemes (range, list, hash), compression algorithms, and query tuning approaches. The research will measure performance across various analytical query patterns such as aggregations, joins, time-series analysis, and complex multi-table queries. Results will be documented with detailed execution plans, performance metrics, and recommendations for when to use each database architecture.