Data Validation Tests¶
Data validation tests run automatically after preprocessing (Step 3 in the pipeline). If any test fails, the pipeline stops — training will not proceed with invalid data.
Running Tests¶
python -m pytest tests/test_data.py -v
Test Descriptions¶
test_data_files_exist¶
Verifies that all expected processed data files exist on disk:
data/processed/X_train.csvdata/processed/X_valid.csvdata/processed/y_train.csvdata/processed/y_valid.csvdata/processed/X_test_competition.csvdata/processed/test_passenger_ids.csv
test_preprocessing_pipeline_exists¶
Verifies that the fitted preprocessing pipeline pickle file exists at the path specified in config.yaml (default: models/preprocessing_pipeline.pkl).
test_no_missing_values_after_transform¶
Applies the fitted onehot_preprocessor to both X_train and X_valid, then asserts that the transformed arrays contain no NaN values.
This validates that the SimpleImputer steps in the preprocessing pipeline are working correctly.
test_correct_shape¶
Asserts that:
X_trainandX_validhave the same number of columns- The column count matches the total number of configured features (
numeric_features+categorical_features)
test_target_is_binary¶
Asserts that y_train and y_valid only contain values 0 and 1.
test_no_duplicate_rows¶
Checks for duplicate rows in X_train. Because we drop identifier columns (Name, Ticket, Cabin, PassengerId), some passengers may naturally have identical feature values.
Note
This test warns rather than fails, since feature-level duplicates are expected for the Titanic dataset after dropping identifiers.
test_train_valid_split_ratio¶
Asserts that the actual validation split ratio approximately matches the configured test_size (within ±5% tolerance).
Fixtures¶
The tests use pytest module-scoped fixtures to load data efficiently:
| Fixture | Description |
|---|---|
processed_dir |
Resolved path to data/processed/ |
X_train |
Loaded X_train.csv as a DataFrame |
X_valid |
Loaded X_valid.csv as a DataFrame |
y_train |
Loaded y_train.csv as a Series |
y_valid |
Loaded y_valid.csv as a Series |
pipeline_bundle |
Loaded preprocessing pipeline pickle dict |