22. Preprocess API¶

This page documents the preprocess API. For workflows, see Preprocess how-to.

22.1 What it is for¶

The preprocess brick runs deterministic, cacheable transformation pipelines to produce feature matrices. ^[1][2]

22.2 Examples¶

Build and run a plan:

from modssc.data_loader import load_dataset
from modssc.preprocess import PreprocessPlan, StepConfig, preprocess

ds = load_dataset("toy", download=True)
plan = PreprocessPlan(steps=(StepConfig(step_id="core.ensure_2d"), StepConfig(step_id="core.to_numpy")))
res = preprocess(ds, plan, seed=0, fit_indices=list(range(len(ds.train.y))))
print(res.preprocess_fingerprint)

Inspect available steps:

from modssc.preprocess import available_steps, step_info

print(available_steps())
print(step_info("core.ensure_2d"))

The step catalog is defined in src/modssc/preprocess/catalog.py. ^[3]

22.3 API reference¶

Preprocessing and representation utilities for ModSSC.

This module provides a deterministic, cacheable, and extensible preprocessing pipeline that consumes datasets from :mod:modssc.data_loader and (optionally) split information from :mod:modssc.sampling.

Design goals: - minimal imports (optional dependencies are loaded lazily) - reproducibility via plan + dataset fingerprint + fit scope fingerprint - step registry for easy contribution

22.4 `PreprocessPlan` `dataclass` ¶

A preprocessing plan.

A plan is independent from a dataset. Conditional logic is applied during resolution.

Source code in src/modssc/preprocess/plan.py

@dataclass(frozen=True)
class PreprocessPlan:
    """A preprocessing plan.

    A plan is independent from a dataset. Conditional logic is applied during resolution.
    """

    steps: tuple[StepConfig, ...]
    output_key: str = "features.X"

    def to_dict(self) -> dict[str, Any]:
        return {
            "output_key": self.output_key,
            "steps": [
                {
                    "id": s.step_id,
                    "params": dict(s.params),
                    "modalities": list(s.modalities),
                    "requires_fields": list(s.requires_fields),
                    "enabled": bool(s.enabled),
                }
                for s in self.steps
            ],
        }

    def fingerprint(self) -> str:
        return fingerprint(self.to_dict(), prefix="plan:")

22.5 `StepConfig` `dataclass` ¶

A single step configuration in a preprocessing plan.

Source code in src/modssc/preprocess/plan.py

@dataclass(frozen=True)
class StepConfig:
    """A single step configuration in a preprocessing plan."""

    step_id: str
    params: Mapping[str, Any] = field(default_factory=dict)
    modalities: tuple[str, ...] = ()
    requires_fields: tuple[str, ...] = ()
    enabled: bool = True

22.6 `fit_transform(*args, **kwargs)` ¶

Alias for preprocess().

Source code in src/modssc/preprocess/api.py

def fit_transform(*args: Any, **kwargs: Any) -> PreprocessResult:
    """Alias for preprocess()."""
    return preprocess(*args, **kwargs)

22.7 `preprocess(dataset, plan, *, seed=0, fit_indices=None, cache=True, cache_dir=None, purge_unused_artifacts=False, registry=None)` ¶

Run preprocessing.

Parameters - dataset: LoadedDataset from modssc.data_loader - plan: PreprocessPlan - seed: master seed for deterministic steps - fit_indices: optional indices (relative to train split) used by fittable steps - cache: enable step-level cache - cache_dir: optional override of preprocessing cache directory - purge_unused_artifacts: drop artifacts not needed by downstream steps

Source code in src/modssc/preprocess/api.py

def preprocess(
    dataset: LoadedDataset,
    plan: PreprocessPlan,
    *,
    seed: int = 0,
    fit_indices: np.ndarray | None = None,
    cache: bool = True,
    cache_dir: str | None = None,
    purge_unused_artifacts: bool = False,
    registry: StepRegistry | None = None,
) -> PreprocessResult:
    """Run preprocessing.

    Parameters
    - dataset: LoadedDataset from modssc.data_loader
    - plan: PreprocessPlan
    - seed: master seed for deterministic steps
    - fit_indices: optional indices (relative to train split) used by fittable steps
    - cache: enable step-level cache
    - cache_dir: optional override of preprocessing cache directory
    - purge_unused_artifacts: drop artifacts not needed by downstream steps
    """
    start = perf_counter()
    reg = registry or default_step_registry()
    resolved = resolve_plan(dataset, plan, registry=reg)
    dataset_fp = _dataset_fingerprint(dataset)

    fit_fp = "fit:none"
    if fit_indices is not None:
        fit_arr = np.asarray(fit_indices, dtype=np.int64).reshape(-1)
        fit_hash = hashlib.sha256(fit_arr.tobytes()).hexdigest()
        fit_fp = f"fit:{fit_hash}"

    preprocess_fp = fingerprint(
        {
            "dataset_fp": dataset_fp,
            "resolved_plan_fp": resolved.fingerprint,
            "fit_fp": fit_fp,
            "seed": int(seed),
        },
        prefix="preprocess:",
    )

    cm = None
    if cache:
        cm = CacheManager.for_dataset(dataset_fp)
        if cache_dir is not None:
            cm.root = Path(cache_dir).expanduser().resolve()

    train_store = _initial_store(dataset.train)
    test_store = _initial_store(dataset.test) if dataset.test is not None else None
    purge_keep_sets = None
    if purge_unused_artifacts:
        purge_keep_sets = _build_purge_keep_sets(
            resolved.steps, output_key=plan.output_key, initial_keys=set(train_store)
        )
        logger.info(
            "Preprocess purge enabled: retaining minimal artifacts per step (steps=%s)",
            len(purge_keep_sets),
        )
        if purge_keep_sets and logger.isEnabledFor(logging.DEBUG):
            logger.debug("Preprocess purge keep keys (step 0): %s", sorted(purge_keep_sets[0]))

    logger.info(
        "Preprocess start: dataset_fp=%s steps=%s output_key=%s seed=%s cache=%s",
        dataset_fp,
        [s.step_id for s in resolved.steps],
        plan.output_key,
        seed,
        bool(cache),
    )
    logger.debug(
        "Preprocess input shapes: train_X=%s train_y=%s test_X=%s test_y=%s",
        _shape_of(dataset.train.X),
        _shape_of(dataset.train.y),
        _shape_of(dataset.test.X) if dataset.test is not None else None,
        _shape_of(dataset.test.y) if dataset.test is not None else None,
    )
    logger.info(
        "Preprocess input size: train_data=%s test_data=%s",
        _format_split_size(train_store, output_key=plan.output_key),
        _format_split_size(test_store, output_key=plan.output_key),
    )

    prov_train = {k: f"{dataset_fp}:{k}" for k in train_store}
    prov_test = {k: f"{dataset_fp}:{k}" for k in (test_store if test_store else [])}

    gpu_logged = False
    for _step_num, step in enumerate(resolved.steps):
        step_id = step.step_id
        spec = step.spec
        derived = derive_seed(seed, step_id=step_id, step_index=step.index)
        rng = np.random.default_rng(derived)
        step_start = perf_counter()
        logger.debug(
            "Preprocess step start: id=%s index=%s kind=%s params=%s",
            step_id,
            step.index,
            spec.kind,
            dict(step.params),
        )

        # Compute step fingerprint with input provenance.
        inputs_train = {k: prov_train.get(k) for k in spec.consumes if k in prov_train}
        inputs_test = (
            {k: prov_test.get(k) for k in spec.consumes if k in prov_test} if test_store else {}
        )
        step_fp = fingerprint(
            {
                "dataset_fp": dataset_fp,
                "step_id": step_id,
                "index": step.index,
                "params": dict(step.params),
                "kind": spec.kind,
                "seed": int(derived),
                "fit_fp": fit_fp if spec.kind == "fittable" else None,
                "inputs_train": inputs_train,
                "inputs_test": inputs_test,
            },
            prefix="step:",
        )

        step_obj = reg.instantiate(step_id, params=dict(step.params))

        # Fit if needed.
        if spec.kind == "fittable":
            if fit_indices is None:
                raise PreprocessValidationError(
                    f"Step {step_id!r} is fittable but fit_indices is None."
                )
            if not hasattr(step_obj, "fit"):
                raise PreprocessValidationError(
                    f"Step {step_id!r} declared fittable but has no fit()."
                )
            step_obj.fit(train_store, fit_indices=np.asarray(fit_indices, dtype=np.int64), rng=rng)

        # Load from cache if available, otherwise compute and save.
        produced_train: dict[str, Any] | None = None
        train_from_cache = False
        if cm is not None and cm.has_step_outputs(step_fp, split="train"):
            try:
                produced_train = cm.load_step_outputs(step_fingerprint=step_fp, split="train")
                if not _cache_outputs_complete(produced_train, spec.produces):
                    raise PreprocessCacheError(
                        f"Incomplete cached outputs for step {step_id!r} (train)"
                    )
                train_from_cache = True
            except PreprocessCacheError as e:
                logger.warning("Preprocess cache miss for %s (train): %s", step_id, e)
                produced_train = None
        if produced_train is None:
            produced_train = step_obj.transform(train_store, rng=rng)
            if not isinstance(produced_train, dict):
                raise PreprocessValidationError(
                    f"Step {step_id!r} must return a dict of produced artifacts."
                )
            if cm is not None:
                cm.save_step_outputs(
                    step_fingerprint=step_fp,
                    split="train",
                    produced=produced_train,
                    manifest={
                        "step_id": step_id,
                        "index": step.index,
                        "params": dict(step.params),
                        "kind": spec.kind,
                        "required_extra": spec.required_extra,
                        "consumes": list(spec.consumes),
                        "produces": list(spec.produces),
                        "inputs_train": inputs_train,
                        "fit_fp": fit_fp if spec.kind == "fittable" else None,
                        "seed": int(derived),
                    },
                )

        for k, v in produced_train.items():
            train_store.set(k, v)
            prov_train[k] = step_fp
        logger.debug(
            "Preprocess step train outputs: id=%s keys=%s duration_s=%.3f",
            step_id,
            sorted(produced_train.keys()),
            perf_counter() - step_start,
        )

        produced_test: dict[str, Any] | None = None
        test_from_cache = False
        if test_store is not None:
            if cm is not None and cm.has_step_outputs(step_fp, split="test"):
                try:
                    produced_test = cm.load_step_outputs(step_fingerprint=step_fp, split="test")
                    if not _cache_outputs_complete(produced_test, spec.produces):
                        raise PreprocessCacheError(
                            f"Incomplete cached outputs for step {step_id!r} (test)"
                        )
                    test_from_cache = True
                except PreprocessCacheError as e:
                    logger.warning("Preprocess cache miss for %s (test): %s", step_id, e)
                    produced_test = None
            if produced_test is None:
                produced_test = step_obj.transform(test_store, rng=rng)
                if not isinstance(produced_test, dict):
                    raise PreprocessValidationError(
                        f"Step {step_id!r} must return a dict of produced artifacts."
                    )
                if cm is not None:
                    cm.save_step_outputs(
                        step_fingerprint=step_fp,
                        split="test",
                        produced=produced_test,
                        manifest={
                            "step_id": step_id,
                            "index": step.index,
                            "params": dict(step.params),
                            "kind": spec.kind,
                            "required_extra": spec.required_extra,
                            "consumes": list(spec.consumes),
                            "produces": list(spec.produces),
                            "inputs_test": inputs_test,
                            "fit_fp": fit_fp if spec.kind == "fittable" else None,
                            "seed": int(derived),
                        },
                    )

            for k, v in produced_test.items():
                test_store.set(k, v)
                prov_test[k] = step_fp
            logger.debug(
                "Preprocess step test outputs: id=%s keys=%s",
                step_id,
                sorted(produced_test.keys()),
            )

        if purge_keep_sets is not None:
            keep = purge_keep_sets[_step_num]
            _purge_store(train_store, keep=keep)
            if test_store is not None:
                _purge_store(test_store, keep=keep)

        if not gpu_logged:
            computed = not train_from_cache
            if test_store is not None:
                computed = computed or not test_from_cache
            gpu_logged = _maybe_log_gpu_info(
                step.params,
                step_obj,
                produced_train=produced_train,
                produced_test=produced_test,
                use_device_hint=computed,
            )

        step_duration = perf_counter() - step_start
        logger.info(
            "Preprocess step done: id=%s index=%s duration_s=%.3f train_data=%s test_data=%s",
            step_id,
            step.index,
            step_duration,
            _format_split_size(train_store, output_key=plan.output_key),
            _format_split_size(test_store, output_key=plan.output_key),
        )

    # Choose final X for downstream training.
    out_key = plan.output_key
    X_train = train_store.get(out_key, train_store.require("raw.X"))
    y_train = train_store.get("labels.y", train_store.require("raw.y"))

    edges_train = train_store.get("graph.edge_index", dataset.train.edges)
    if train_store.has("graph.edge_weight"):
        edges_train = {
            "edge_index": train_store.get("graph.edge_index"),
            "edge_weight": train_store.get("graph.edge_weight"),
        }

    train_out = Split(X=X_train, y=y_train, edges=edges_train, masks=dataset.train.masks)

    test_out = None
    if dataset.test is not None and test_store is not None:
        X_test = test_store.get(out_key, test_store.require("raw.X"))
        y_test = test_store.get("labels.y", test_store.require("raw.y"))
        edges_test = test_store.get("graph.edge_index", dataset.test.edges)
        if test_store.has("graph.edge_weight"):
            edges_test = {
                "edge_index": test_store.get("graph.edge_index"),
                "edge_weight": test_store.get("graph.edge_weight"),
            }
        test_out = Split(X=X_test, y=y_test, edges=edges_test, masks=dataset.test.masks)

    meta = dict(dataset.meta)
    meta.update(
        {
            "preprocess_fingerprint": preprocess_fp,
            "preprocess_plan_fingerprint": resolved.fingerprint,
            "preprocess_fit_fingerprint": fit_fp,
        }
    )
    if cm is not None:
        meta["preprocess_cache_dir"] = str(cm.dataset_dir())

    out_dataset = LoadedDataset(train=train_out, test=test_out, meta=meta)
    _maybe_warn_nonfinite("train.X", train_out.X)
    if test_out is not None:
        _maybe_warn_nonfinite("test.X", test_out.X)

    logger.info(
        "Preprocess done: dataset_fp=%s duration_s=%.3f train_X=%s test_X=%s",
        dataset_fp,
        perf_counter() - start,
        _shape_of(train_out.X),
        _shape_of(test_out.X) if test_out is not None else None,
    )
    return PreprocessResult(
        dataset=out_dataset,
        plan=resolved,
        preprocess_fingerprint=preprocess_fp,
        train_artifacts=train_store,
        test_artifacts=test_store,
        cache_dir=str(cm.dataset_dir()) if cm is not None else None,
    )

Sources

22. Preprocess API¶

22.1 What it is for¶

22.2 Examples¶

22.3 API reference¶

22.4 PreprocessPlan dataclass ¶

22.5 StepConfig dataclass ¶

22.6 fit_transform(*args, **kwargs) ¶

22.7 preprocess(dataset, plan, *, seed=0, fit_indices=None, cache=True, cache_dir=None, purge_unused_artifacts=False, registry=None) ¶

22.4 `PreprocessPlan` `dataclass` ¶

22.5 `StepConfig` `dataclass` ¶

22.6 `fit_transform(*args, **kwargs)` ¶

22.7 `preprocess(dataset, plan, *, seed=0, fit_indices=None, cache=True, cache_dir=None, purge_unused_artifacts=False, registry=None)` ¶