31. Preprocess API¶

This page documents the preprocess API. For workflows, see Preprocess how-to.

31.1 What it is for¶

The preprocess brick runs deterministic, cacheable transformation pipelines to produce feature matrices. ^[1][2]

31.2 Examples¶

Build and run a plan:

from modssc.data_loader import load_dataset
from modssc.preprocess import PreprocessPlan, StepConfig, preprocess

ds = load_dataset("toy", download=True)
plan = PreprocessPlan(steps=(StepConfig(step_id="core.ensure_2d"), StepConfig(step_id="core.to_numpy")))
res = preprocess(ds, plan, seed=0, fit_indices=list(range(len(ds.train.y))))
print(res.preprocess_fingerprint)

Inspect available steps:

from modssc.preprocess import available_steps, step_info

print(available_steps())
print(step_info("core.ensure_2d"))

The step catalog is defined in src/modssc/preprocess/catalog.py. The package-level API remains modssc.preprocess, while internal orchestration is split between plan/catalog modules and internal service modules. ^[3]

Warning

Preprocess cache artifacts are intended for trusted local reuse only. Do not reuse manifests or cached tensor files from untrusted sources.

31.3 API reference¶

Preprocessing and representation utilities for ModSSC.

This module provides a deterministic, cacheable, and extensible preprocessing pipeline that consumes datasets from :mod:modssc.data_loader and (optionally) split information from :mod:modssc.sampling.

Design goals: - minimal imports (optional dependencies are loaded lazily) - reproducibility via plan + dataset fingerprint + fit scope fingerprint - step registry for easy contribution

31.4 `PreprocessPlan` `dataclass` ¶

A preprocessing plan.

A plan is independent from a dataset. Conditional logic is applied during resolution.

Source code in src/modssc/preprocess/plan.py

@dataclass(frozen=True)
class PreprocessPlan:
    """A preprocessing plan.

    A plan is independent from a dataset. Conditional logic is applied during resolution.
    """

    steps: tuple[StepConfig, ...]
    output_key: str = "features.X"

    def to_dict(self) -> dict[str, Any]:
        return {
            "output_key": self.output_key,
            "steps": [
                {
                    "id": s.step_id,
                    "params": dict(s.params),
                    "modalities": list(s.modalities),
                    "requires_fields": list(s.requires_fields),
                    "enabled": bool(s.enabled),
                }
                for s in self.steps
            ],
        }

    def fingerprint(self) -> str:
        return fingerprint(self.to_dict(), prefix="plan:")

31.5 `StepConfig` `dataclass` ¶

A single step configuration in a preprocessing plan.

Source code in src/modssc/preprocess/plan.py

@dataclass(frozen=True)
class StepConfig:
    """A single step configuration in a preprocessing plan."""

    step_id: str
    params: Mapping[str, Any] = field(default_factory=dict)
    modalities: tuple[str, ...] = ()
    requires_fields: tuple[str, ...] = ()
    enabled: bool = True

Sources