31. Preprocess API¶
This page documents the preprocess API. For workflows, see Preprocess how-to.
31.1 What it is for¶
The preprocess brick runs deterministic, cacheable transformation pipelines to produce feature matrices. [1][2]
31.2 Examples¶
Build and run a plan:
from modssc.data_loader import load_dataset
from modssc.preprocess import PreprocessPlan, StepConfig, preprocess
ds = load_dataset("toy", download=True)
plan = PreprocessPlan(steps=(StepConfig(step_id="core.ensure_2d"), StepConfig(step_id="core.to_numpy")))
res = preprocess(ds, plan, seed=0, fit_indices=list(range(len(ds.train.y))))
print(res.preprocess_fingerprint)
Inspect available steps:
from modssc.preprocess import available_steps, step_info
print(available_steps())
print(step_info("core.ensure_2d"))
The step catalog is defined in src/modssc/preprocess/catalog.py. The package-level API remains modssc.preprocess, while internal orchestration is split between plan/catalog modules and internal service modules. [3]
Warning
Preprocess cache artifacts are intended for trusted local reuse only. Do not reuse manifests or cached tensor files from untrusted sources.
31.3 API reference¶
Preprocessing and representation utilities for ModSSC.
This module provides a deterministic, cacheable, and extensible preprocessing pipeline that
consumes datasets from :mod:modssc.data_loader and (optionally) split information from
:mod:modssc.sampling.
Design goals: - minimal imports (optional dependencies are loaded lazily) - reproducibility via plan + dataset fingerprint + fit scope fingerprint - step registry for easy contribution
31.4
PreprocessPlan
dataclass
¶
A preprocessing plan.
A plan is independent from a dataset. Conditional logic is applied during resolution.
Source code in src/modssc/preprocess/plan.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | |
31.5
StepConfig
dataclass
¶
A single step configuration in a preprocessing plan.
Source code in src/modssc/preprocess/plan.py
13 14 15 16 17 18 19 20 21 | |