7. Skip to content

7. How to create and reuse sampling splits

Here you will build reproducible SSL splits that you can reuse across methods and experiments. The steps map to the sampling plan fields and the examples show both CLI and Python forms. For end-to-end usage, see the inductive tutorial or transductive tutorial.

7.1 Problem statement

You need a deterministic train/val/test split with labeled vs unlabeled partitions, and you want to reuse it across runs. [1][2][3] Saving the split keeps experiments comparable when you switch methods or rerun trials.

7.2 When to use

Use this when you want reproducible SSL splits across methods or when sharing splits with collaborators. [4][3]

7.3 Steps

1) Define a sampling plan (holdout or k-fold, plus labeling strategy). [2]

2) Create the split from a dataset (CLI or Python). [5][1]

3) Save and load the split for reuse. [3]

If you are running the bench runner, embed the plan in the experiment config described in the configuration reference.

7.4 Copy-paste example

Use the CLI when you want terminal-driven split artifacts (modssc sampling in src/modssc/cli/sampling.py), and use Python when you want to generate splits inside a pipeline (API in src/modssc/sampling/api.py). [5][1]

Sample plan (from bench/configs/experiments/toy_inductive.yaml): [6]

split:
  kind: "holdout"
  test_fraction: 0.0
  val_fraction: 0.2
  stratify: true
  shuffle: true
labeling:
  mode: "fraction"
  value: 0.2
  strategy: "balanced"
  min_per_class: 1
imbalance:
  kind: "none"
policy:
  respect_official_test: true
  allow_override_official: false

CLI:

modssc sampling create --dataset toy --plan sampling_plan.yaml --seed 42 --out splits/toy
modssc sampling show splits/toy
modssc sampling validate splits/toy --dataset toy

Python:

from modssc.data_loader import load_dataset
from modssc.sampling import HoldoutSplitSpec, LabelingSpec, SamplingPlan, sample, save_split

ds = load_dataset("toy", download=True)
plan = SamplingPlan(
    split=HoldoutSplitSpec(test_fraction=0.0, val_fraction=0.2, stratify=True),
    labeling=LabelingSpec(mode="fraction", value=0.2, strategy="balanced"),
)
res, _ = sample(ds, plan=plan, seed=42, dataset_fingerprint=str(ds.meta["dataset_fingerprint"]))
_ = save_split(res, out_dir="splits/toy", overwrite=True)

7.5 Pitfalls

Warning

modssc sampling create expects the dataset fingerprint in dataset.meta["dataset_fingerprint"]. This is injected by the data loader when datasets are cached. [7][1]

Tip

Split artifacts include split.json and arrays.npz in the output directory, which is what load_split expects. [3]

Sources
  1. src/modssc/sampling/api.py
  2. src/modssc/sampling/plan.py
  3. src/modssc/sampling/storage.py
  4. src/modssc/sampling/fingerprint.py
  5. src/modssc/cli/sampling.py
  6. bench/configs/experiments/toy_inductive.yaml
  7. src/modssc/data_loader/api.py