7. How to create and reuse sampling splits¶
Here you will build reproducible SSL splits that you can reuse across methods and experiments. The steps map to the sampling plan fields and the examples show both CLI and Python forms. For end-to-end usage, see the inductive tutorial or transductive tutorial.
7.1 Problem statement¶
You need a deterministic train/val/test split with labeled vs unlabeled partitions, and you want to reuse it across runs. [1][2][3] Saving the split keeps experiments comparable when you switch methods or rerun trials.
7.2 When to use¶
Use this when you want reproducible SSL splits across methods or when sharing splits with collaborators. [4][3]
7.3 Steps¶
1) Define a sampling plan (holdout or k-fold, plus labeling strategy). [2]
2) Create the split from a dataset (CLI or Python). [5][1]
3) Save and load the split for reuse. [3]
If you are running the bench runner, embed the plan in the experiment config described in the configuration reference.
7.4 Copy-paste example¶
Use the CLI when you want terminal-driven split artifacts (modssc sampling in src/modssc/cli/sampling.py), and use Python when you want to generate splits inside a pipeline (API in src/modssc/sampling/api.py). [5][1]
Sample plan (from bench/configs/experiments/toy_inductive.yaml): [6]
split:
kind: "holdout"
test_fraction: 0.0
val_fraction: 0.2
stratify: true
shuffle: true
labeling:
mode: "fraction"
value: 0.2
strategy: "balanced"
min_per_class: 1
imbalance:
kind: "none"
policy:
respect_official_test: true
allow_override_official: false
CLI:
modssc sampling create --dataset toy --plan sampling_plan.yaml --seed 42 --out splits/toy
modssc sampling show splits/toy
modssc sampling validate splits/toy --dataset toy
Python:
from modssc.data_loader import load_dataset
from modssc.sampling import HoldoutSplitSpec, LabelingSpec, SamplingPlan, sample, save_split
ds = load_dataset("toy", download=True)
plan = SamplingPlan(
split=HoldoutSplitSpec(test_fraction=0.0, val_fraction=0.2, stratify=True),
labeling=LabelingSpec(mode="fraction", value=0.2, strategy="balanced"),
)
res, _ = sample(ds, plan=plan, seed=42, dataset_fingerprint=str(ds.meta["dataset_fingerprint"]))
_ = save_split(res, out_dir="splits/toy", overwrite=True)
7.5 Pitfalls¶
Warning
modssc sampling create expects the dataset fingerprint in dataset.meta["dataset_fingerprint"]. This is injected by the data loader when datasets are cached. [7][1]
Tip
Split artifacts include split.json and arrays.npz in the output directory, which is what load_split expects. [3]