Projects

Data engineering projects demonstrating end-to-end pipelines, cloud warehousing, and database optimization

Projects In Development

Urban Mobility & Transportation Optimizer

Status: In Progress

Technologies: Python, Apache Airflow, PostgreSQL, AWS S3, Docker, Pandas, SQL

Building an end-to-end real-time data pipeline that processes live transportation data to help optimize commute decisions. The system integrates multiple data sources including public transit GPS locations, bike-share availability APIs, and weather data through a Lambda Architecture combining both streaming and batch processing layers.

Implementing Apache Kafka for real-time event streaming to capture live vehicle positions and availability updates every 30 seconds. The pipeline processes geospatial data to calculate optimal routes, predict delays using machine learning models, and provide real-time recommendations. Apache Airflow orchestrates daily batch jobs for historical pattern analysis, ML model retraining, and data warehouse updates.

The system features a multi-layered storage architecture: PostgreSQL for operational data (current vehicle states), S3 data lake for raw event storage, and Snowflake for analytical queries on historical patterns. Includes comprehensive monitoring, data quality validation, and an interactive Streamlit dashboard displaying live transportation status with predictive delay alerts and multimodal route recommendations.

Current Status: In early development phase - setting up Kafka infrastructure and integrating initial data sources.

Database Performance Optimization Study

Status: Planned

Technologies: PostgreSQL, ClickHouse, Python, Docker, Jupyter Notebooks, TPC-H Benchmark

Planning a comprehensive performance comparison of row-based (PostgreSQL) and columnar (ClickHouse) database architectures for analytics workloads using the industry-standard TPC-H benchmark dataset. The study will systematically evaluate query performance, storage efficiency, compression ratios, and memory utilization across different data scales ranging from 1GB to 100GB.

Will test multiple optimization techniques including indexing strategies (B-tree, hash, bloom filters), partitioning schemes (range, list, hash), compression algorithms, and query tuning approaches. The research will measure performance across various analytical query patterns such as aggregations, joins, time-series analysis, and complex multi-table queries. Results will be documented with detailed execution plans, performance metrics, and recommendations for when to use each database architecture.

Current Status: Research design and methodology in development. Setting up Docker environments for both databases and preparing TPC-H data generation scripts. Benchmarking framework and detailed results will be published upon completion.