Decision Trees and Boosting for Predictive Modeling

1. What This Project Does

This project is an end-to-end machine learning pipeline designed to predict passenger survival outcomes using the Titanic dataset. The objective was to develop robust ensemble-based classification models while implementing core algorithms from scratch, performing advanced feature engineering, and validating results using industry-standard evaluation and experiment tracking tools.

2. Project Workflow

  • Preprocessed the Titanic dataset by handling missing values, encoding categorical features, performing normalization, and executing feature engineering using NumPy and Pandas.
  • Implemented a custom Decision Tree classifier from scratch supporting Gini, Entropy, and Misclassification splitting criteria.
  • Built a Random Forest ensemble using bootstrapped sampling and randomized feature selection to improve model robustness.
  • Developed a custom AdaBoost algorithm utilizing weak learners and exponential reweighting to enhance predictive performance.
  • Integrated XGBoost models for advanced boosting comparisons.
  • Performed systematic train–test splits and evaluated models using scikit-learn metrics including Accuracy, Precision, Recall, and F1-score.
  • Conducted hyperparameter tuning to optimize ensemble performance and address class imbalance.
  • Tracked experiments and logged 12+ performance metrics using MLflow to enable reproducible analysis.
  • Created interactive Power BI dashboards to visualize training outcomes and compare model performance across algorithms.

3. Results and Impact

Through model optimization and ensemble boosting techniques, prediction accuracy improved from 76% to 82%, while the F1-score increased from 0.72 to 0.78. The project enabled transparent performance evaluation through metric tracking and live dashboards, demonstrating the effectiveness of ensemble learning and custom algorithm implementation for real-world classification challenges.

4. Summary

This project combines hands-on algorithm development with applied machine learning engineering to deliver a highly optimized classification pipeline. By building models from scratch, applying advanced ensemble techniques, logging experiments with MLflow, and visualizing results through Power BI dashboards, the project demonstrates strong proficiency across the full machine learning lifecycle from data preparation through model deployment analysis.

Tech Stack

  • Python
  • NumPy
  • Pandas
  • scikit-learn
  • MLflow
  • Power BI
  • Custom Decision Tree Implementation
  • Random Forest
  • AdaBoost
  • XGBoost