4. Inductive tutorial: toy pseudo-label run¶

This is an end-to-end inductive walkthrough using the toy benchmark config. If you prefer step-by-step brick workflows, use the datasets, sampling, preprocess, and evaluation guides.

4.1 Goal¶

Run a full inductive SSL experiment on the built-in toy dataset using the benchmark runner and a YAML config. ^[1][2][3]

4.2 Why this tutorial¶

Use this tutorial when your method consumes feature matrices and labeled/unlabeled splits (InductiveDataset) and you do not need an explicit graph. If your method expects a graph and node masks, use the transductive tutorial instead. ^[19][20]

This walkthrough uses the bench runner because it validates a single YAML config and orchestrates the full pipeline (dataset, sampling, preprocess, method, evaluation). For individual bricks, start with the dataset, sampling, preprocess, and evaluation how-to guides instead. ^[1][9]

4.3 Prerequisites¶

Python 3.11+ with ModSSC installed from source (bench runner is in the repo). ^[4][5]
No extra dependencies are required for the toy dataset and numpy backends used here. ^[2][6]

4.4 Files used¶

Benchmark entry point: bench/main.py
Experiment config: bench/configs/experiments/toy_inductive.yaml
Toy dataset definition: src/modssc/data_loader/catalog/toy.py

4.5 Step by step commands¶

1) Install the repo in editable mode:

python -m pip install -e "."

2) Run the inductive toy experiment:

python -m bench.main --config bench/configs/experiments/toy_inductive.yaml

The bench runner and the example config are in bench/main.py and bench/configs/experiments/toy_inductive.yaml. ^[1][2]

4.6 Full YAML config used¶

This is the full config file from bench/configs/experiments/toy_inductive.yaml:

run:
  name: "toy_pseudo_label_numpy"
  seed: 42
  output_dir: "runs"
  fail_fast: true

dataset:
  id: "toy"

sampling:
  seed: 42
  plan:
    split:
      kind: "holdout"
      test_fraction: 0.0
      val_fraction: 0.2
      stratify: true
      shuffle: true
    labeling:
      mode: "fraction"
      value: 0.2
      strategy: "balanced"
      min_per_class: 1
    imbalance:
      kind: "none"
    policy:
      respect_official_test: true
      allow_override_official: false

preprocess:
  seed: 42
  fit_on: "train_labeled"
  cache: true
  plan:
    output_key: "features.X"
    steps:
      - id: "core.ensure_2d"
      - id: "core.to_numpy"

method:
  kind: "inductive"
  id: "pseudo_label"
  device:
    device: "auto"
    dtype: "float32"
  params:
    classifier_id: "knn"
    classifier_backend: "numpy"
    max_iter: 5
    confidence_threshold: 0.8

evaluation:
  split_for_model_selection: "val"
  report_splits: ["val", "test"]
  metrics: ["accuracy", "macro_f1"]

4.7 Expected outputs and where they appear¶

A new run directory is created under runs/ with: - config.yaml (config snapshot) - run.json (metrics and metadata) - error.txt (if the run fails)

These outputs are written by the bench context and reporting orchestrator. ^[7][8]

4.8 How it works¶

bench/main.py loads the YAML, validates it against the schema, and orchestrates each stage. ^[1][9]
The toy dataset is loaded via the data loader and cached. ^[10][3]
Sampling produces labeled/unlabeled splits using the sampling plan. ^[11][12]
Preprocess steps convert raw features into 2D numpy arrays. ^[13][14][15]
The pseudo-label method runs with a numpy kNN classifier. ^[16][6]

4.9 Common pitfalls and troubleshooting¶

Warning

If the run fails because runs/ already contains a folder with the same name and timestamp collision, delete the old folder and rerun. The run directory is created with exist_ok=False. ^[7]

Tip

Use modssc --log-level detailed to increase logging detail if a stage fails. ^[17][18]

Sources