`src.preprocess`¶

Preprocessing module. Loads raw data, builds sklearn preprocessing pipelines, splits data into train/validation sets, and saves all artifacts.

Usage¶

python src/preprocess.py

Functions¶

`build_preprocessors`¶

def build_preprocessors(
    numeric_features: list[str],
    categorical_features: list[str],
) -> tuple[ColumnTransformer, ColumnTransformer]

Builds two sklearn ColumnTransformer preprocessing pipelines:

onehot_preprocessor — uses SimpleImputer(median) for numerics + OneHotEncoder for categoricals. Used by: Random Forest, Extra Trees, Gradient Boosting, XGBoost.
hist_preprocessor — uses SimpleImputer(median) for numerics + OrdinalEncoder for categoricals. Used by: Hist Gradient Boosting.

Parameters:

Parameter	Type	Description
`numeric_features`	`list[str]`	Column names of numeric features (e.g. `["Age", "Fare"]`)
`categorical_features`	`list[str]`	Column names of categorical features (e.g. `["Sex", "Embarked"]`)

Returns:

Type	Description
`tuple[ColumnTransformer, ColumnTransformer]`	`(onehot_preprocessor, hist_preprocessor)`

Pipeline Details:

onehot_preprocessor:
  ├─ num: SimpleImputer(strategy="median")
  └─ cat: SimpleImputer(strategy="most_frequent") → OneHotEncoder(handle_unknown="ignore")

hist_preprocessor:
  ├─ num: SimpleImputer(strategy="median")
  └─ cat: SimpleImputer(strategy="most_frequent") → OrdinalEncoder(handle_unknown="use_encoded_value")

`run_preprocessing`¶

def run_preprocessing() -> None

Main preprocessing pipeline entry point.

Steps:

Load raw train.csv and test.csv from paths in config
Drop columns specified in preprocessing.dropped_columns
Extract target column (Survived)
Split into train/validation using train_test_split with stratification
Build and fit both preprocessors on X_train
Compute helper values: scale_pos_weight, hist_categorical_feature_idx
Save processed CSVs to data/processed/
Save preprocessing pipeline bundle to pickle

Config Keys Used:

Key	Description
`data.train_csv`	Path to raw training CSV
`data.test_csv`	Path to raw test CSV
`data.processed_dir`	Output directory for processed data
`preprocessing.dropped_columns`	Columns to exclude from features
`preprocessing.target_column`	Target column name
`preprocessing.numeric_features`	Numeric feature column names
`preprocessing.categorical_features`	Categorical feature column names
`preprocessing.test_size`	Validation split ratio
`preprocessing.random_state`	Random seed
`preprocessing.pipeline_path`	Path to save the pipeline pickle

Output Files:

File	Content
`data/processed/X_train.csv`	Training features
`data/processed/X_valid.csv`	Validation features
`data/processed/y_train.csv`	Training labels
`data/processed/y_valid.csv`	Validation labels
`data/processed/X_test_competition.csv`	Competition test features
`data/processed/test_passenger_ids.csv`	PassengerIds for inference
`models/preprocessing_pipeline.pkl`	Fitted preprocessor bundle

Pickle Bundle Contents:

The saved pickle file contains a dictionary with:

{
    "onehot_preprocessor": ColumnTransformer,   # Fitted onehot pipeline
    "hist_preprocessor": ColumnTransformer,      # Fitted ordinal pipeline
    "scale_pos_weight": float,                   # Class weight ratio
    "hist_categorical_feature_idx": list[int],   # Categorical column indices
    "numeric_features": list[str],               # Feature names
    "categorical_features": list[str],           # Feature names
}

Raises:

Exception	Condition
`FileNotFoundError`	If `train.csv` is not found at the configured path

src.preprocess¶

Usage¶

Functions¶

build_preprocessors¶

run_preprocessing¶

`src.preprocess`¶

`build_preprocessors`¶

`run_preprocessing`¶