25. Skip to content

25. Configuration reference

Use this reference when you need the structure of benchmark YAML files or brick-level plans. For runnable examples and execution advice, continue with Benchmarks and the Bench config cookbook.

25.1 What it is for

ModSSC uses YAML configs for the benchmark runner and structured plan/spec files for sampling, preprocess, views, graph construction, augmentation, and HPO. These files are validated before execution, so this page is the place to confirm field names, top-level blocks, and the behavior of advanced switches. [2][3][5][6][7]

25.2 When to use

  • Use this page when editing a benchmark YAML and you need to confirm the allowed shape of a block.
  • Use this page when debugging config-loading failures, unresolved environment variables, or schema errors.
  • Use the how-to guides instead when you want workflow advice rather than field-level reference.

25.3 Minimal examples

Run a benchmark config:

python -m bench.main --config bench/configs/experiments/toy_inductive.yaml

Run a config that uses environment placeholders:

MODSSC_OUTPUT_DIR=/tmp/modssc_runs \
MODSSC_DATASET_CACHE_DIR=/tmp/modssc_cache/datasets \
MODSSC_PREPROCESS_CACHE_DIR=/tmp/modssc_cache/preprocess \
python -m bench.main --config bench/configs/experiments/toy_inductive.yaml

Load and inspect a config in Python:

from bench.schema import ExperimentConfig
from bench.utils.io import load_yaml

cfg = ExperimentConfig.from_dict(load_yaml("bench/configs/experiments/toy_inductive.yaml"))
print(cfg.method.kind, cfg.method.method_id)

The bench entry point, schema, and YAML loader are implemented in bench/main.py, bench/schema.py, and bench/utils/io.py. [8][2][3]

25.4 Where configs live

  • Authored benchmark examples and templates: bench/configs/experiments/. [1]
  • Curated benchmark suites and command listings: bench/configs/best/.
  • Cluster launchers and deployment helpers may also exist under local bench/ subdirectories such as bench/slurm/, but those operational assets are not part of the stable public documentation contract.
  • Brick-level plan files can also live outside bench/, but they must still satisfy the same schema objects described below.

25.5 How configs are loaded

  • Bench configs are read from YAML with bench.utils.io.load_yaml and validated via ExperimentConfig.from_dict. [3][2]
  • Environment variables are expanded inside string values, so placeholders like ${MODSSC_OUTPUT_DIR} or $MODSSC_OUTPUT_DIR are valid in YAML. Unresolved placeholders fail fast with an explicit error instead of being left as literal strings. [3]
  • method.model.factory is an advanced extension hook for trusted configs only. It is rejected unless run.allow_custom_factories: true is set explicitly. [2]
  • Runtime cache environment variables outside YAML placeholders are MODSSC_CACHE_ROOT, MODSSC_CACHE_DIR, MODSSC_PREPROCESS_CACHE_DIR, MODSSC_SPLIT_CACHE_DIR, MODSSC_GRAPH_CACHE_DIR, and MODSSC_GRAPH_VIEWS_CACHE_DIR.
  • CLI plan/spec files use load_yaml_or_json and their own from_dict validators. [4][5][6]

25.6 Config blocks at a glance

Block What it controls Required core fields
run run identity, output, logging, sweep behavior name, seed, output_dir, fail_fast
dataset dataset selection and provider options id
sampling split and labeling plan seed, plan
preprocess feature pipeline before training seed, fit_on, cache, plan
method inductive or transductive method selection kind, id, device, params
evaluation metrics and report splits report_splits, metrics
graph graph construction for transductive workflows optional block
views multi-view feature generation optional block
augmentation weak/strong training-time augmentation optional block
search HPO search procedure and space optional block

25.7 Experiment config schema (bench)

Top-level keys and their required fields are defined in bench/schema.py:

  • run: name, seed, output_dir, fail_fast, optional log_level, optional seeds, optional benchmark_mode, optional allow_custom_factories. [2]
  • dataset: id, optional options, download, cache_dir. [2]
  • sampling: seed, plan. [2]
  • preprocess: seed, fit_on, cache, plan. [2]
  • method: kind (inductive or transductive), id, device, params, optional model. model accepts classifier settings or an advanced factory + params pair when run.allow_custom_factories=true. [2]
  • evaluation: report_splits, metrics, optional split_for_model_selection. [2]
  • Optional blocks: graph, views, augmentation, search. [2]

25.8 Sampling plan schema

Sampling plans are validated by SamplingPlan.from_dict and include: - split: kind (holdout or kfold) plus its parameters [9][10][11][12][13][14][5] - labeling: mode (fraction, count, per_class), value, strategy, min_per_class, optional fixed_indices [5] - imbalance: kind (none, subsample_max_per_class, long_tail) and its parameters [5] - policy: respect_official_test, use_official_graph_masks, allow_override_official [5]

When allow_override_official=true on inductive datasets, provider test partitions are ignored and the user split is applied to dataset.train. See the glossary if these policy names are still unfamiliar.

25.9 Preprocess plan schema

A preprocess plan YAML contains: - output_key [15][7] - steps: list of step mappings with id, params, optional modalities, requires_fields, enabled [7]

Available step IDs and their metadata are listed in the built-in catalog. [16]

25.10 Views plan schema

Views plans define multiple feature views: - views: list of view definitions with name, optional preprocess, columns, and meta [17] - columns supports all, indices, random, and complement [17]

25.11 Graph builder spec schema

Graph specs match GraphBuilderSpec: - scheme: knn, epsilon, or anchor - metric: cosine or euclidean - k or radius depending on scheme - symmetrize, weights, normalize, self_loops - backend: auto, numpy, sklearn, or faiss - anchor-specific and faiss-specific parameters

All validation rules are defined in src/modssc/graph/specs.py. [6]

25.12 Augmentation plan schema

Augmentation plans include: - steps: list of ops with id and params [18][19][2][20] - modality: optional modality hint

Augmentation ops are registered in src/modssc/data_augmentation/ops/. [21][22]

25.13 Search (HPO) schema

The bench search block includes: - enabled, kind (grid or random), seed, n_trials, repeats - objective: split, metric, direction, aggregate - space: nested mapping of method.params.* to lists or distributions

Validation rules are enforced by bench/schema.py, and distributions are defined in src/modssc/hpo/samplers.py. [2][23]

25.14 Common mistakes

  • Using a field name that exists in Python objects but not in YAML. The schema names here are the contract for configs.
  • Leaving ${VAR_NAME} unresolved in a config and expecting the loader to ignore it.
  • Treating method.model.factory as a normal extension point for shared configs. It is intentionally gated behind run.allow_custom_factories: true.
  • Forgetting that fit_on changes the fitted preprocess state and therefore the cache fingerprint.
  • Mixing “official test split” semantics with “custom split” semantics without checking the sampling policy flags.

25.15 Complete example configs

Toy inductive experiment: [24]

run:
  name: "toy_pseudo_label_numpy"
  seed: 42
  output_dir: "runs"
  fail_fast: true

dataset:
  id: "toy"

sampling:
  seed: 42
  plan:
    split:
      kind: "holdout"
      test_fraction: 0.0
      val_fraction: 0.2
      stratify: true
      shuffle: true
    labeling:
      mode: "fraction"
      value: 0.2
      strategy: "balanced"
      min_per_class: 1
    imbalance:
      kind: "none"
    policy:
      respect_official_test: true
      allow_override_official: false

preprocess:
  seed: 42
  fit_on: "train_labeled"
  cache: true
  plan:
    output_key: "features.X"
    steps:
      - id: "core.ensure_2d"
      - id: "core.to_numpy"

method:
  kind: "inductive"
  id: "pseudo_label"
  device:
    device: "auto"
    dtype: "float32"
  params:
    classifier_id: "knn"
    classifier_backend: "numpy"
    max_iter: 5
    confidence_threshold: 0.8

evaluation:
  split_for_model_selection: "val"
  report_splits: ["val", "test"]
  metrics: ["accuracy", "macro_f1"]

Toy transductive experiment: [25]

run:
  name: "toy_label_propagation_knn"
  seed: 7
  output_dir: "runs"
  fail_fast: true

dataset:
  id: "toy"

sampling:
  seed: 7
  plan:
    split:
      kind: "holdout"
      test_fraction: 0.0
      val_fraction: 0.2
      stratify: true
      shuffle: true
    labeling:
      mode: "fraction"
      value: 0.1
      strategy: "balanced"
      min_per_class: 1
    imbalance:
      kind: "none"
    policy:
      respect_official_test: true
      allow_override_official: false

preprocess:
  seed: 7
  fit_on: "train_labeled"
  cache: true
  plan:
    output_key: "features.X"
    steps:
      - id: "core.ensure_2d"
      - id: "core.to_numpy"

graph:
  enabled: true
  seed: 7
  cache: true
  spec:
    scheme: "knn"
    metric: "euclidean"
    k: 8
    symmetrize: "mutual"
    weights:
      kind: "heat"
      sigma: 1.0
    normalize: "rw"
    self_loops: true
    backend: "numpy"
    chunk_size: 128
    feature_field: "features.X"

method:
  kind: "transductive"
  id: "label_propagation"
  device:
    device: "auto"
    dtype: "float32"
  params:
    max_iter: 50
    tol: 1.0e-4
    normalize_rows: true

evaluation:
  report_splits: ["val", "test"]
  metrics: ["accuracy", "macro_f1"]

Example with augmentation: [26]

run:
  name: best_text_inductive_softmatch_ag_news
  seed: 2
  output_dir: runs/inductive/softmatch/text/ag_news
  log_level: detailed
  fail_fast: true
dataset:
  id: ag_news
  download: true
  options:
    text_column: text
    label_column: label
    prefer_test_split: true
sampling:
  seed: 2
  plan:
    split:
      kind: holdout
      test_fraction: 0.2
      val_fraction: 0.1
      stratify: true
      shuffle: true
    labeling:
      mode: fraction
      value: 0.2
      strategy: balanced
      min_per_class: 1
    imbalance:
      kind: none
    policy:
      respect_official_test: true
      use_official_graph_masks: true
      allow_override_official: false
preprocess:
  seed: 2
  fit_on: train_labeled
  cache: true
  plan:
    output_key: features.X
    steps:
    - id: labels.encode
    - id: text.ensure_strings
    - id: text.sentence_transformer
      params:
        batch_size: 64
    - id: core.pca
      params:
        n_components: 128
    - id: core.to_torch
      params:
        device: "auto"
        dtype: float32
augmentation:
  enabled: true
  seed: 2
  mode: fixed
  modality: tabular
  weak:
    steps:
    - id: tabular.gaussian_noise
      params:
        std: 0.01
  strong:
    steps:
    - id: tabular.feature_dropout
      params:
        p: 0.2
method:
  kind: inductive
  id: softmatch
  device:
    device: "auto"
    dtype: float32
  params:
    lambda_u: 1.0
    temperature: 0.5
    ema_p: 0.999
    n_sigma: 2.0
    per_class: false
    dist_align: true
    dist_uniform: true
    hard_label: true
    use_cat: false
    batch_size: 128
    max_epochs: 50
    detach_target: true
  model:
    classifier_id: mlp
    classifier_backend: torch
    classifier_params:
      hidden_sizes:
      - 128
      activation: relu
      dropout: 0.1
      lr: 0.001
      weight_decay: 0.0
      batch_size: 256
      max_epochs: 50
    ema: false
evaluation:
  split_for_model_selection: val
  report_splits:
  - val
  - test
  metrics:
  - accuracy
  - macro_f1
Sources
  1. bench/configs/experiments/
  2. bench/schema.py
  3. bench/utils/io.py
  4. src/modssc/cli/_utils.py
  5. src/modssc/sampling/plan.py
  6. src/modssc/graph/specs.py
  7. src/modssc/preprocess/plan.py
  8. bench/main.py
  9. src/modssc/sampling/plan.py
  10. src/modssc/sampling/plan.py
  11. src/modssc/sampling/plan.py
  12. src/modssc/sampling/plan.py
  13. src/modssc/sampling/plan.py
  14. src/modssc/sampling/plan.py
  15. src/modssc/preprocess/plan.py
  16. src/modssc/preprocess/catalog.py
  17. src/modssc/views/plan.py
  18. src/modssc/data_augmentation/plan.py
  19. src/modssc/data_augmentation/registry.py
  20. src/modssc/data_augmentation/ops/
  21. src/modssc/data_augmentation/registry.py
  22. src/modssc/data_augmentation/ops/
  23. src/modssc/hpo/samplers.py
  24. bench/configs/experiments/toy_inductive.yaml
  25. bench/configs/experiments/toy_transductive.yaml
  26. bench/configs/experiments/minimal/inductive/pseudo_label/text/imdb.yaml