Pipeline Overview¶

The Titanic ML pipeline is a sequential workflow that automates the full machine learning lifecycle — from raw data to predictions. It can be run either through run_pipeline.sh or reproduced stage-by-stage through DVC using dvc.yaml.

DVC Integration¶

The repository includes a DVC pipeline with these stages:

download
preprocess
validate
train
evaluate

Before using dvc pull or dvc push, configure the existing origin remote locally:

dvc remote modify origin --local access_key_id c355...2ffc
dvc remote modify origin --local secret_access_key c355...2ffc

Run the pipeline with:

dvc repro

Pipeline Flow¶

┌──────────────────────────────────────────────────────────────┐
│                    run_pipeline.sh                           │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Step 1: download_data.py                                    │
│  ├─ Check if data/raw/train.csv exists                       │
│  ├─ If not → download from Kaggle via kagglehub              │
│  └─ Output: data/raw/train.csv, data/raw/test.csv            │
│                          ↓                                   │
│  Step 2: preprocess.py                                       │
│  ├─ Load raw CSVs                                            │
│  ├─ Drop columns (Cabin, PassengerId, Ticket, Name)          │
│  ├─ Build two sklearn ColumnTransformers:                    │
│  │   ├─ onehot_preprocessor (for RF, ET, GB, XGB)            │
│  │   └─ hist_preprocessor (for HistGBT — uses OrdinalEncoder)│
│  ├─ Train/valid split (80/20, stratified)                    │
│  ├─ Save CSVs → data/processed/                              │
│  └─ Save pipelines → models/preprocessing_pipeline.pkl       │
│                          ↓                                   │
│  Step 3: pytest tests/test_data.py                           │
│  ├─ Validate: no missing values after transform              │
│  ├─ Validate: correct column count                           │
│  ├─ Validate: binary target (0 or 1)                         │
│  ├─ Validate: split ratio matches config                     │
│  └─ ⛔ Pipeline STOPS if any test fails                      │
│                          ↓                                   │
│  Step 4: train.py                                            │
│  ├─ Load processed data + preprocessing pipelines            │
│  ├─ Run Optuna study (n_trials from config)                  │
│  ├─ Search across 6 model families:                          │
│  │   RF, ExtraTrees, GBT, HistGBT, XGBoost, CatBoost         │
│  ├─ 5-fold StratifiedKFold cross-validation                  │
│  ├─ Optional: log to Weights & Biases and MLflow             │
│  ├─ Retrain best model on full train+valid set               │
│  ├─ Save model → models/best_model.pkl                       │
│  ├─ Save metrics → reports/train_metrics.json                │
│  └─ Save chart → reports/figures/optuna_top10_accuracy.png   │
│                          ↓                                   │
│  Step 5: test_model.py                                       │
│  ├─ Load best model from pickle                              │
│  ├─ Predict on X_valid                                       │
│  ├─ Report accuracy + ROC-AUC                                │
│  └─ Update reports/metrics.json                              │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Model Families¶

The Optuna study searches across these 6 model families:

Model	Library	Preprocessor Used
Random Forest	scikit-learn	OneHotEncoder
Extra Trees	scikit-learn	OneHotEncoder
Gradient Boosting	scikit-learn	OneHotEncoder
Hist Gradient Boosting	scikit-learn	OrdinalEncoder
XGBoost	xgboost	OneHotEncoder
CatBoost	catboost	Native categorical handling

Generated Outputs¶

File	Description
`models/best_model.pkl`	Serialized best model (includes preprocessing pipeline)
`models/preprocessing_pipeline.pkl`	Fitted sklearn `ColumnTransformer` objects
`reports/train_metrics.json`	Training accuracy, ROC-AUC, best CV accuracy, best params
`reports/metrics.json`	Validation accuracy and ROC-AUC
`reports/figures/optuna_top10_accuracy.png`	Bar chart of top 10 Optuna trials by accuracy
`data/processed/*.csv`	Processed train/valid splits

Logging¶

All pipeline modules use a centralized colored logger (src/logger.py).

Log output format:

2026-04-22 19:00:03 | INFO | preprocess | Loaded train: (891, 12), test: (418, 11)

Color coding:

Level	Color
`DEBUG`	Cyan
`INFO`	Green
`WARNING`	Yellow
`ERROR`	Red
`CRITICAL`	Bold Red