Data Validation Tests¶

Data validation tests run automatically after preprocessing (Step 3 in the pipeline). If any test fails, the pipeline stops — training will not proceed with invalid data.

Running Tests¶

python -m pytest tests/test_data.py -v

Test Descriptions¶

`test_data_files_exist`¶

Verifies that all expected processed data files exist on disk:

data/processed/X_train.csv
data/processed/X_valid.csv
data/processed/y_train.csv
data/processed/y_valid.csv
data/processed/X_test_competition.csv
data/processed/test_passenger_ids.csv

`test_preprocessing_pipeline_exists`¶

Verifies that the fitted preprocessing pipeline pickle file exists at the path specified in config.yaml (default: models/preprocessing_pipeline.pkl).

`test_no_missing_values_after_transform`¶

Applies the fitted onehot_preprocessor to both X_train and X_valid, then asserts that the transformed arrays contain no NaN values.

This validates that the SimpleImputer steps in the preprocessing pipeline are working correctly.

`test_correct_shape`¶

Asserts that:

X_train and X_valid have the same number of columns
The column count matches the total number of configured features (numeric_features + categorical_features)

`test_target_is_binary`¶

Asserts that y_train and y_valid only contain values 0 and 1.

`test_no_duplicate_rows`¶

Checks for duplicate rows in X_train. Because we drop identifier columns (Name, Ticket, Cabin, PassengerId), some passengers may naturally have identical feature values.

Note

This test warns rather than fails, since feature-level duplicates are expected for the Titanic dataset after dropping identifiers.

`test_train_valid_split_ratio`¶

Asserts that the actual validation split ratio approximately matches the configured test_size (within ±5% tolerance).

Fixtures¶

The tests use pytest module-scoped fixtures to load data efficiently:

Fixture	Description
`processed_dir`	Resolved path to `data/processed/`
`X_train`	Loaded `X_train.csv` as a DataFrame
`X_valid`	Loaded `X_valid.csv` as a DataFrame
`y_train`	Loaded `y_train.csv` as a Series
`y_valid`	Loaded `y_valid.csv` as a Series
`pipeline_bundle`	Loaded preprocessing pipeline pickle dict

Data Validation Tests¶

Running Tests¶

Test Descriptions¶

test_data_files_exist¶

test_preprocessing_pipeline_exists¶

test_no_missing_values_after_transform¶

test_correct_shape¶

test_target_is_binary¶

test_no_duplicate_rows¶

test_train_valid_split_ratio¶