src.download_data¶
Data download module. Downloads the Titanic dataset from Kaggle via kagglehub, or skips if the data already exists locally.
Usage¶
python src/download_data.py
Functions¶
download_data¶
def download_data() -> None
Downloads the Titanic competition data from Kaggle if train.csv and test.csv are not already present in the configured data.raw_dir.
Behavior:
- Checks if
data/raw/train.csvanddata/raw/test.csvexist - If both exist → logs a message and returns (skips download)
- If not → reads
KAGGLE_KEYfrom environment, sets the API token, and callskagglehub.competition_download("titanic", ...)
Environment Variables Required:
| Variable | Description |
|---|---|
KAGGLE_KEY |
Kaggle API key (loaded from .env) |
Config Keys Used:
| Key | Description |
|---|---|
data.raw_dir |
Directory to save downloaded files |
data.train_csv |
Expected path of train.csv |
data.test_csv |
Expected path of test.csv |
Raises:
| Exception | Condition |
|---|---|
EnvironmentError |
If KAGGLE_KEY is not found in the environment |
Example:
from src.download_data import download_data
download_data()
# Logs: "Data already exists at data/raw. Skipping download."
# — or —
# Logs: "Downloading Titanic dataset to data/raw ..."