4. Inductive tutorial: toy pseudo-label run¶
This is an end-to-end inductive walkthrough using the toy benchmark config. If you prefer step-by-step brick workflows, use the datasets, sampling, preprocess, and evaluation guides.
4.1 Goal¶
Run a full inductive SSL experiment on the built-in toy dataset using the benchmark runner and a YAML config. [1][2][3]
4.2 Why this tutorial¶
Use this tutorial when your method consumes feature matrices and labeled/unlabeled splits (InductiveDataset) and you do not need an explicit graph. If your method expects a graph and node masks, use the transductive tutorial instead. [19][20]
This walkthrough uses the bench runner because it validates a single YAML config and orchestrates the full pipeline (dataset, sampling, preprocess, method, evaluation). For individual bricks, start with the dataset, sampling, preprocess, and evaluation how-to guides instead. [1][9]
4.3 Prerequisites¶
-
Python 3.11+ with ModSSC installed from source (bench runner is in the repo). [4][5]
-
No extra dependencies are required for the toy dataset and numpy backends used here. [2][6]
4.4 Files used¶
- Benchmark entry point:
bench/main.py - Experiment config:
bench/configs/experiments/toy_inductive.yaml - Toy dataset definition:
src/modssc/data_loader/catalog/toy.py
4.5 Step by step commands¶
1) Install the repo in editable mode:
python -m pip install -e "."
2) Run the inductive toy experiment:
python -m bench.main --config bench/configs/experiments/toy_inductive.yaml
The bench runner and the example config are in bench/main.py and bench/configs/experiments/toy_inductive.yaml. [1][2]
4.6 Full YAML config used¶
This is the full config file from bench/configs/experiments/toy_inductive.yaml:
run:
name: "toy_pseudo_label_numpy"
seed: 42
output_dir: "runs"
fail_fast: true
dataset:
id: "toy"
sampling:
seed: 42
plan:
split:
kind: "holdout"
test_fraction: 0.0
val_fraction: 0.2
stratify: true
shuffle: true
labeling:
mode: "fraction"
value: 0.2
strategy: "balanced"
min_per_class: 1
imbalance:
kind: "none"
policy:
respect_official_test: true
allow_override_official: false
preprocess:
seed: 42
fit_on: "train_labeled"
cache: true
plan:
output_key: "features.X"
steps:
- id: "core.ensure_2d"
- id: "core.to_numpy"
method:
kind: "inductive"
id: "pseudo_label"
device:
device: "auto"
dtype: "float32"
params:
classifier_id: "knn"
classifier_backend: "numpy"
max_iter: 5
confidence_threshold: 0.8
evaluation:
split_for_model_selection: "val"
report_splits: ["val", "test"]
metrics: ["accuracy", "macro_f1"]
4.7 Expected outputs and where they appear¶
A new run directory is created under runs/ with:
- config.yaml (config snapshot)
- run.json (metrics and metadata)
- error.txt (if the run fails)
These outputs are written by the bench context and reporting orchestrator. [7][8]
4.8 How it works¶
-
bench/main.pyloads the YAML, validates it against the schema, and orchestrates each stage. [1][9] -
The toy dataset is loaded via the data loader and cached. [10][3]
-
Sampling produces labeled/unlabeled splits using the sampling plan. [11][12]
-
Preprocess steps convert raw features into 2D numpy arrays. [13][14][15]
-
The pseudo-label method runs with a numpy kNN classifier. [16][6]
4.9 Common pitfalls and troubleshooting¶
Warning
If the run fails because runs/ already contains a folder with the same name and timestamp collision, delete the old folder and rerun. The run directory is created with exist_ok=False. [7]
4.10 Related links¶
Sources
bench/main.pybench/configs/experiments/toy_inductive.yamlsrc/modssc/data_loader/catalog/toy.pypyproject.tomlbench/README.mdsrc/modssc/supervised/backends/numpy/knn.pybench/context.pybench/orchestrators/reporting.pybench/schema.pysrc/modssc/data_loader/api.pysrc/modssc/sampling/api.pysrc/modssc/sampling/plan.pysrc/modssc/preprocess/plan.pysrc/modssc/preprocess/steps/core/ensure_2d.pysrc/modssc/preprocess/steps/core/to_numpy.pysrc/modssc/inductive/methods/pseudo_label.pysrc/modssc/logging.pysrc/modssc/cli/app.pysrc/modssc/inductive/types.pysrc/modssc/transductive/base.py