Getting Started¶
This guide walks you through setting up and running the Titanic ML pipeline from scratch.
Prerequisites¶
- Python 3.12+
- A Kaggle account with an API key (how to get one)
- (Optional) A Weights & Biases account for experiment tracking
- (Optional) An MLflow tracking server for Optuna trial logging
- DVC available in your environment for pipeline reproduction and remote artifact sync
1. Clone the Repository¶
git clone https://github.com/saadaltohamy/mlops-lab0.git
cd mlops-lab0
2. Install uv (recommended)¶
uv is a blazing-fast Python package manager built in Rust. Install it with:
curl -LsSf https://astral.sh/uv/install.sh | sh
3. Install Dependencies¶
Option A: Using uv (recommended)¶
uv sync
This reads pyproject.toml + uv.lock, creates a .venv, and installs all dependencies in one step.
To activate the environment afterwards:
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
Option B: Using pip¶
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
pip install -e .
This installs all required packages defined in pyproject.toml, including:
- scikit-learn, XGBoost, CatBoost
- Optuna + optuna-integration[wandb]
- Hydra (configuration management)
- pandas, matplotlib, seaborn
- pytest, python-dotenv
4. Set Up Environment Variables¶
cp .env.example .env
Edit .env with your credentials:
KAGGLE_USERNAME=your_kaggle_username
KAGGLE_KEY=your_kaggle_api_key
# Optional — wandb logging auto-enables when this key is set
WANDB_API_KEY=your_wandb_api_key
WANDB_PROJECT=mlops-lab0
# Optional — Optuna trials are logged to MLflow when this URI is set
MLFLOW_TRACKING_URI=http://localhost:5000
# Optional — set these when your MLflow server requires basic auth
MLFLOW_TRACKING_USERNAME=your_mlflow_username
MLFLOW_TRACKING_PASSWORD=your_mlflow_password
Note
If WANDB_API_KEY or MLFLOW_TRACKING_URI is not set, the pipeline runs normally and that tracking integration is skipped with a warning.
5. Configure DVC Remote Credentials¶
The repository already defines the origin DVC remote. Configure credentials locally before using dvc pull or dvc push:
dvc remote modify origin --local access_key_id c355...2ffc
dvc remote modify origin --local secret_access_key c355...2ffc
These credentials are written to .dvc/config.local and should not be committed.
6. Run the Full Pipeline¶
bash run_pipeline.sh
This runs all steps in sequence:
- Download — fetches Titanic data from Kaggle (skips if data already exists)
- Preprocess — builds sklearn pipelines, splits data, saves artifacts
- Validate — runs pytest data quality checks (stops pipeline on failure)
- Train — runs Optuna HPO (30 trials), saves best model + metrics + chart
- Test — evaluates best model on validation set
7. Reproduce with DVC¶
The same workflow is also available through dvc.yaml:
dvc repro
Common DVC commands:
dvc status
dvc push
dvc pull
8. Run Inference¶
After training, predict on new data:
python src/inference.py --input data/raw/test.csv --output reports/predictions.csv
9. Serve the registered MLflow model¶
Build the serving image:
docker build -t titanic-mlflow-serve .
Run the container with your MLflow credentials from .env:
docker run -p 5000:5000 --env-file .env titanic-mlflow-serve
Invoke the served model:
curl -X POST http://127.0.0.1:5000/invocations \
-H "Content-Type: application/json" \
-d '{
"dataframe_records": [
{
"Pclass": 3,
"Sex": "male",
"Age": 22,
"SibSp": 1,
"Parch": 0,
"Fare": 7.25,
"Embarked": "S"
}
]
}'
Running Individual Steps¶
Each module can be run independently:
python -m src.download_data # Step 1: Download data
python -m src.preprocess # Step 2: Preprocess
python -m pytest tests/test_data.py -v # Step 3: Validate
python -m src.train # Step 4: Train
python -m src.test_model # Step 5: Test
Using CLI Overrides (Hydra)¶
Override any config value without editing config.yaml:
# Quick run with 5 trials instead of 30
python -m src.train training.n_trials=5
# Override multiple values
python -m src.train training.n_trials=50 training.cv_folds=10
# Change split ratio
python -m src.preprocess preprocessing.test_size=0.3
See Configuration for full override syntax and examples.