8. How to run preprocessing plans¶
Treat this as a recipe for building preprocessing plans and running them from the CLI or Python. The steps mirror the plan structure used in configs, and the examples show the same flow in code; for step and model IDs, see Catalogs and registries.
8.1 Problem statement¶
You want to turn raw datasets into feature matrices using a deterministic, cacheable preprocessing plan. [1][2] Plans make it easy to reuse the same feature pipeline across experiments.
8.2 When to use¶
Use preprocessing to standardize inputs, build embeddings, or convert raw data to numeric features before modeling. [3][4]
8.3 Steps¶
1) Inspect available steps and their metadata. [5][6]
2) Write a preprocessing plan (YAML). [2]
3) Run the plan with the CLI or the Python API. [5][1]
For the full list of step IDs and pretrained model IDs, see Catalogs and registries.
8.4 Copy-paste example¶
Use the CLI when you want to run plan files from the terminal (modssc preprocess in src/modssc/cli/preprocess.py), and use Python when you want to build and run plans in code (API in src/modssc/preprocess/api.py). [5][1]
Plan YAML (from bench/configs/experiments/toy_inductive.yaml): [7]
output_key: "features.X"
steps:
- id: "core.ensure_2d"
- id: "core.to_numpy"
CLI:
modssc preprocess steps list
modssc preprocess run --plan preprocess_plan.yaml --dataset toy --seed 0
Python:
from modssc.data_loader import load_dataset
from modssc.preprocess import PreprocessPlan, StepConfig, preprocess
plan = PreprocessPlan(steps=(StepConfig(step_id="core.ensure_2d"), StepConfig(step_id="core.to_numpy")))
ds = load_dataset("toy", download=True)
result = preprocess(ds, plan, seed=0, fit_indices=list(range(len(ds.train.y))))
print(result.preprocess_fingerprint)
8.5 Pitfalls¶
Warning
Some steps require optional extras (for example, text.tfidf, vision.openclip). Use modssc preprocess steps info <id> or step_info() to check required_extra. [3][6][5][8]
Tip
If your plan includes fittable steps (PCA, TF-IDF, ZCA), pass fit_indices in Python or set fit_on in bench configs to control the fit subset. [1][9]