Configuration¶

All pipeline settings are managed via Hydra — a powerful configuration framework by Meta Research. The project uses Hydra's Compose API to load config.yaml at the project root, with full support for CLI overrides so you can tweak any setting without editing files.

How It Works¶

The src/config.py module uses Hydra's Compose API (initialize_config_dir + compose) instead of the @hydra.main decorator. This allows:

Module-level config — CONFIG is loaded once at import time and shared across all pipeline modules
CLI overrides — any key=value argument passed on the command line automatically overrides the corresponding config value
No Hydra side-effects — no outputs/ directory, no .hydra/ folder, no working directory changes

from src.config import CONFIG, resolve_path

# Access any config value
n_trials = CONFIG["training"]["n_trials"]
train_csv = resolve_path(CONFIG["data"]["train_csv"])

Full Configuration Reference¶

data:
  raw_dir: "data/raw"                  # Directory for raw Kaggle CSVs
  processed_dir: "data/processed"      # Directory for processed train/valid splits
  train_csv: "data/raw/train.csv"      # Path to raw training CSV
  test_csv: "data/raw/test.csv"        # Path to raw test CSV

preprocessing:
  dropped_columns:                     # Columns to drop from features
    - "Cabin"
    - "PassengerId"
    - "Ticket"
    - "Name"
    - "Survived"
  target_column: "Survived"            # Target column name
  numeric_features:                    # Numeric feature columns
    - "Age"
    - "SibSp"
    - "Parch"
    - "Fare"
  categorical_features:                # Categorical feature columns
    - "Pclass"
    - "Sex"
    - "Embarked"
  test_size: 0.2                       # Validation split ratio
  random_state: 42                     # Random seed for reproducibility
  pipeline_path: "models/preprocessing_pipeline.pkl"  # Where to save fitted preprocessors

training:
  n_trials: 30                         # Number of Optuna trials
  cv_folds: 5                          # Number of cross-validation folds
  random_state: 42                     # Random seed
  model_path: "models/best_model.pkl"  # Where to save the best model
  study_name: "titanic_tree_search"    # Optuna study name

reports:
  dir: "reports"                       # Reports output directory
  metrics_file: "reports/metrics.json" # Metrics JSON output path
  optuna_top10_chart: "reports/figures/optuna_top10_accuracy.png"  # Top-10 chart path

models:
  dir: "models"                        # Models output directory

CLI Overrides¶

You can override any config value from the command line using Hydra's key=value syntax — no need to edit config.yaml:

Basic Overrides¶

# Change number of Optuna trials
python -m src.train training.n_trials=5

# Change validation split ratio
python -m src.preprocess preprocessing.test_size=0.3

# Override multiple values at once
python -m src.train training.n_trials=50 training.cv_folds=10

# Change model output path
python -m src.train training.model_path="models/experiment_v2.pkl"

Nested Key Syntax¶

Use dot notation to reach nested config values:

# Override the raw data directory
python -m src.download_data data.raw_dir="data/custom_raw"

# Override the study name
python -m src.train training.study_name="my_custom_study"

Override Lists¶

# Override numeric features
python -m src.preprocess 'preprocessing.numeric_features=["Age", "Fare"]'

Override Examples by Pipeline Step¶

Step	Example Command
Download	`python -m src.download_data data.raw_dir="data/custom"`
Preprocess	`python -m src.preprocess preprocessing.test_size=0.3`
Train	`python -m src.train training.n_trials=50 training.cv_folds=10`
Test	`python -m src.test_model training.model_path="models/experiment.pkl"`

Environment Variables¶

Environment variables are loaded from the .env file at the project root. These are separate from Hydra config and control external service credentials.

Variable	Required	Description
`KAGGLE_USERNAME`	Yes (for download)	Your Kaggle username
`KAGGLE_KEY`	Yes (for download)	Your Kaggle API key
`WANDB_API_KEY`	No	Weights & Biases API key — enables experiment tracking
`WANDB_PROJECT`	No	wandb project name (default: `mlops-lab0`)

Tip

Copy .env.example to .env and fill in your values: bash cp .env.example .env

Customizing the Pipeline¶

Quick experiments via CLI (no file changes needed)¶

# Fast trial run — 5 trials, 3 folds
python -m src.train training.n_trials=5 training.cv_folds=3

# Full production run — 100 trials, 10 folds
python -m src.train training.n_trials=100 training.cv_folds=10

Permanent changes via `config.yaml`¶

Edit config.yaml directly for changes you want to persist:

training:
  n_trials: 50  # Increase for more thorough search

Add or remove features¶

preprocessing:
  numeric_features: ["Age", "SibSp", "Parch", "Fare"]
  categorical_features: ["Pclass", "Sex", "Embarked"]

Warning

If you modify feature lists, make sure the corresponding columns exist in the raw CSV, and re-run preprocessing before training.

How Config Loading Works Internally¶

┌─────────────────────────────────────────────────┐
│           python -m src.train                    │
│               training.n_trials=5                │
├─────────────────────────────────────────────────┤
│                                                  │
│  1. src/config.py is imported                    │
│  2. get_project_root() finds config.yaml         │
│  3. _parse_cli_overrides() extracts              │
│     ["training.n_trials=5"] from sys.argv        │
│  4. Hydra compose() loads config.yaml            │
│     and applies the overrides                    │
│  5. CONFIG dict is available globally            │
│                                                  │
│  Result: CONFIG["training"]["n_trials"] == 5     │
│          (instead of the default 30)             │
│                                                  │
└─────────────────────────────────────────────────┘

Note

Config is loaded once at import time (CONFIG = load_config() at module level). All modules that import CONFIG share the same overridden values for that process run.