Configuration

All pipeline settings are managed via Hydra — a powerful configuration framework by Meta Research. The project uses Hydra's Compose API to load config.yaml at the project root, with full support for CLI overrides so you can tweak any setting without editing files.


How It Works

The src/config.py module uses Hydra's Compose API (initialize_config_dir + compose) instead of the @hydra.main decorator. This allows:

  • Module-level configCONFIG is loaded once at import time and shared across all pipeline modules
  • CLI overrides — any key=value argument passed on the command line automatically overrides the corresponding config value
  • No Hydra side-effects — no outputs/ directory, no .hydra/ folder, no working directory changes
from src.config import CONFIG, resolve_path

# Access any config value
n_trials = CONFIG["training"]["n_trials"]
train_csv = resolve_path(CONFIG["data"]["train_csv"])

Full Configuration Reference

data:
  raw_dir: "data/raw"                  # Directory for raw Kaggle CSVs
  processed_dir: "data/processed"      # Directory for processed train/valid splits
  train_csv: "data/raw/train.csv"      # Path to raw training CSV
  test_csv: "data/raw/test.csv"        # Path to raw test CSV

preprocessing:
  dropped_columns:                     # Columns to drop from features
    - "Cabin"
    - "PassengerId"
    - "Ticket"
    - "Name"
    - "Survived"
  target_column: "Survived"            # Target column name
  numeric_features:                    # Numeric feature columns
    - "Age"
    - "SibSp"
    - "Parch"
    - "Fare"
  categorical_features:                # Categorical feature columns
    - "Pclass"
    - "Sex"
    - "Embarked"
  test_size: 0.2                       # Validation split ratio
  random_state: 42                     # Random seed for reproducibility
  pipeline_path: "models/preprocessing_pipeline.pkl"  # Where to save fitted preprocessors

training:
  n_trials: 30                         # Number of Optuna trials
  cv_folds: 5                          # Number of cross-validation folds
  random_state: 42                     # Random seed
  model_path: "models/best_model.pkl"  # Where to save the best model
  study_name: "titanic_tree_search"    # Optuna study name

reports:
  dir: "reports"                       # Reports output directory
  metrics_file: "reports/metrics.json" # Metrics JSON output path
  optuna_top10_chart: "reports/figures/optuna_top10_accuracy.png"  # Top-10 chart path

models:
  dir: "models"                        # Models output directory

CLI Overrides

You can override any config value from the command line using Hydra's key=value syntax — no need to edit config.yaml:

Basic Overrides

# Change number of Optuna trials
python -m src.train training.n_trials=5

# Change validation split ratio
python -m src.preprocess preprocessing.test_size=0.3

# Override multiple values at once
python -m src.train training.n_trials=50 training.cv_folds=10

# Change model output path
python -m src.train training.model_path="models/experiment_v2.pkl"

Nested Key Syntax

Use dot notation to reach nested config values:

# Override the raw data directory
python -m src.download_data data.raw_dir="data/custom_raw"

# Override the study name
python -m src.train training.study_name="my_custom_study"

Override Lists

# Override numeric features
python -m src.preprocess 'preprocessing.numeric_features=["Age", "Fare"]'

Override Examples by Pipeline Step

Step Example Command
Download python -m src.download_data data.raw_dir="data/custom"
Preprocess python -m src.preprocess preprocessing.test_size=0.3
Train python -m src.train training.n_trials=50 training.cv_folds=10
Test python -m src.test_model training.model_path="models/experiment.pkl"

Environment Variables

Environment variables are loaded from the .env file at the project root. These are separate from Hydra config and control external service credentials.

Variable Required Description
KAGGLE_USERNAME Yes (for download) Your Kaggle username
KAGGLE_KEY Yes (for download) Your Kaggle API key
WANDB_API_KEY No Weights & Biases API key — enables experiment tracking
WANDB_PROJECT No wandb project name (default: mlops-lab0)

Tip

Copy .env.example to .env and fill in your values: bash cp .env.example .env


Customizing the Pipeline

Quick experiments via CLI (no file changes needed)

# Fast trial run — 5 trials, 3 folds
python -m src.train training.n_trials=5 training.cv_folds=3

# Full production run — 100 trials, 10 folds
python -m src.train training.n_trials=100 training.cv_folds=10

Permanent changes via config.yaml

Edit config.yaml directly for changes you want to persist:

training:
  n_trials: 50  # Increase for more thorough search

Add or remove features

preprocessing:
  numeric_features: ["Age", "SibSp", "Parch", "Fare"]
  categorical_features: ["Pclass", "Sex", "Embarked"]

Warning

If you modify feature lists, make sure the corresponding columns exist in the raw CSV, and re-run preprocessing before training.


How Config Loading Works Internally

┌─────────────────────────────────────────────────┐
│           python -m src.train                    │
│               training.n_trials=5                │
├─────────────────────────────────────────────────┤
│                                                  │
│  1. src/config.py is imported                    │
│  2. get_project_root() finds config.yaml         │
│  3. _parse_cli_overrides() extracts              │
│     ["training.n_trials=5"] from sys.argv        │
│  4. Hydra compose() loads config.yaml            │
│     and applies the overrides                    │
│  5. CONFIG dict is available globally            │
│                                                  │
│  Result: CONFIG["training"]["n_trials"] == 5     │
│          (instead of the default 30)             │
│                                                  │
└─────────────────────────────────────────────────┘

Note

Config is loaded once at import time (CONFIG = load_config() at module level). All modules that import CONFIG share the same overridden values for that process run.