25. Configuration reference¶
Use this reference when you need the structure of benchmark YAML files or brick-level plans. For runnable examples and execution advice, continue with Benchmarks and the Bench config cookbook.
25.1 What it is for¶
ModSSC uses YAML configs for the benchmark runner and structured plan/spec files for sampling, preprocess, views, graph construction, augmentation, and HPO. These files are validated before execution, so this page is the place to confirm field names, top-level blocks, and the behavior of advanced switches. [2][3][5][6][7]
25.2 When to use¶
- Use this page when editing a benchmark YAML and you need to confirm the allowed shape of a block.
- Use this page when debugging config-loading failures, unresolved environment variables, or schema errors.
- Use the how-to guides instead when you want workflow advice rather than field-level reference.
25.3 Minimal examples¶
Run a benchmark config:
python -m bench.main --config bench/configs/experiments/toy_inductive.yaml
Run a config that uses environment placeholders:
MODSSC_OUTPUT_DIR=/tmp/modssc_runs \
MODSSC_DATASET_CACHE_DIR=/tmp/modssc_cache/datasets \
MODSSC_PREPROCESS_CACHE_DIR=/tmp/modssc_cache/preprocess \
python -m bench.main --config bench/configs/experiments/toy_inductive.yaml
Load and inspect a config in Python:
from bench.schema import ExperimentConfig
from bench.utils.io import load_yaml
cfg = ExperimentConfig.from_dict(load_yaml("bench/configs/experiments/toy_inductive.yaml"))
print(cfg.method.kind, cfg.method.method_id)
The bench entry point, schema, and YAML loader are implemented in bench/main.py, bench/schema.py, and bench/utils/io.py. [8][2][3]
25.4 Where configs live¶
- Authored benchmark examples and templates:
bench/configs/experiments/. [1] - Curated benchmark suites and command listings:
bench/configs/best/. - Cluster launchers and deployment helpers may also exist under local
bench/subdirectories such asbench/slurm/, but those operational assets are not part of the stable public documentation contract. - Brick-level plan files can also live outside
bench/, but they must still satisfy the same schema objects described below.
25.5 How configs are loaded¶
- Bench configs are read from YAML with
bench.utils.io.load_yamland validated viaExperimentConfig.from_dict. [3][2] - Environment variables are expanded inside string values, so placeholders like
${MODSSC_OUTPUT_DIR}or$MODSSC_OUTPUT_DIRare valid in YAML. Unresolved placeholders fail fast with an explicit error instead of being left as literal strings. [3] method.model.factoryis an advanced extension hook for trusted configs only. It is rejected unlessrun.allow_custom_factories: trueis set explicitly. [2]- Runtime cache environment variables outside YAML placeholders are
MODSSC_CACHE_ROOT,MODSSC_CACHE_DIR,MODSSC_PREPROCESS_CACHE_DIR,MODSSC_SPLIT_CACHE_DIR,MODSSC_GRAPH_CACHE_DIR, andMODSSC_GRAPH_VIEWS_CACHE_DIR. - CLI plan/spec files use
load_yaml_or_jsonand their ownfrom_dictvalidators. [4][5][6]
25.6 Config blocks at a glance¶
| Block | What it controls | Required core fields |
|---|---|---|
run |
run identity, output, logging, sweep behavior | name, seed, output_dir, fail_fast |
dataset |
dataset selection and provider options | id |
sampling |
split and labeling plan | seed, plan |
preprocess |
feature pipeline before training | seed, fit_on, cache, plan |
method |
inductive or transductive method selection | kind, id, device, params |
evaluation |
metrics and report splits | report_splits, metrics |
graph |
graph construction for transductive workflows | optional block |
views |
multi-view feature generation | optional block |
augmentation |
weak/strong training-time augmentation | optional block |
search |
HPO search procedure and space | optional block |
25.7 Experiment config schema (bench)¶
Top-level keys and their required fields are defined in bench/schema.py:
run:name,seed,output_dir,fail_fast, optionallog_level, optionalseeds, optionalbenchmark_mode, optionalallow_custom_factories. [2]dataset:id, optionaloptions,download,cache_dir. [2]sampling:seed,plan. [2]preprocess:seed,fit_on,cache,plan. [2]method:kind(inductiveortransductive),id,device,params, optionalmodel.modelaccepts classifier settings or an advancedfactory+paramspair whenrun.allow_custom_factories=true. [2]evaluation:report_splits,metrics, optionalsplit_for_model_selection. [2]- Optional blocks:
graph,views,augmentation,search. [2]
25.8 Sampling plan schema¶
Sampling plans are validated by SamplingPlan.from_dict and include:
- split: kind (holdout or kfold) plus its parameters [9][10][11][12][13][14][5]
- labeling: mode (fraction, count, per_class), value, strategy, min_per_class, optional fixed_indices [5]
- imbalance: kind (none, subsample_max_per_class, long_tail) and its parameters [5]
- policy: respect_official_test, use_official_graph_masks, allow_override_official [5]
When allow_override_official=true on inductive datasets, provider test partitions are ignored and the user split is applied to dataset.train. See the glossary if these policy names are still unfamiliar.
25.9 Preprocess plan schema¶
A preprocess plan YAML contains:
- output_key [15][7]
- steps: list of step mappings with id, params, optional modalities, requires_fields, enabled [7]
Available step IDs and their metadata are listed in the built-in catalog. [16]
25.10 Views plan schema¶
Views plans define multiple feature views:
- views: list of view definitions with name, optional preprocess, columns, and meta [17]
- columns supports all, indices, random, and complement [17]
25.11 Graph builder spec schema¶
Graph specs match GraphBuilderSpec:
- scheme: knn, epsilon, or anchor
- metric: cosine or euclidean
- k or radius depending on scheme
- symmetrize, weights, normalize, self_loops
- backend: auto, numpy, sklearn, or faiss
- anchor-specific and faiss-specific parameters
All validation rules are defined in src/modssc/graph/specs.py. [6]
25.12 Augmentation plan schema¶
Augmentation plans include:
- steps: list of ops with id and params [18][19][2][20]
- modality: optional modality hint
Augmentation ops are registered in src/modssc/data_augmentation/ops/. [21][22]
25.13 Search (HPO) schema¶
The bench search block includes:
- enabled, kind (grid or random), seed, n_trials, repeats
- objective: split, metric, direction, aggregate
- space: nested mapping of method.params.* to lists or distributions
Validation rules are enforced by bench/schema.py, and distributions are defined in src/modssc/hpo/samplers.py. [2][23]
25.14 Common mistakes¶
- Using a field name that exists in Python objects but not in YAML. The schema names here are the contract for configs.
- Leaving
${VAR_NAME}unresolved in a config and expecting the loader to ignore it. - Treating
method.model.factoryas a normal extension point for shared configs. It is intentionally gated behindrun.allow_custom_factories: true. - Forgetting that
fit_onchanges the fitted preprocess state and therefore the cache fingerprint. - Mixing “official test split” semantics with “custom split” semantics without checking the sampling policy flags.
25.15 Complete example configs¶
Toy inductive experiment: [24]
run:
name: "toy_pseudo_label_numpy"
seed: 42
output_dir: "runs"
fail_fast: true
dataset:
id: "toy"
sampling:
seed: 42
plan:
split:
kind: "holdout"
test_fraction: 0.0
val_fraction: 0.2
stratify: true
shuffle: true
labeling:
mode: "fraction"
value: 0.2
strategy: "balanced"
min_per_class: 1
imbalance:
kind: "none"
policy:
respect_official_test: true
allow_override_official: false
preprocess:
seed: 42
fit_on: "train_labeled"
cache: true
plan:
output_key: "features.X"
steps:
- id: "core.ensure_2d"
- id: "core.to_numpy"
method:
kind: "inductive"
id: "pseudo_label"
device:
device: "auto"
dtype: "float32"
params:
classifier_id: "knn"
classifier_backend: "numpy"
max_iter: 5
confidence_threshold: 0.8
evaluation:
split_for_model_selection: "val"
report_splits: ["val", "test"]
metrics: ["accuracy", "macro_f1"]
Toy transductive experiment: [25]
run:
name: "toy_label_propagation_knn"
seed: 7
output_dir: "runs"
fail_fast: true
dataset:
id: "toy"
sampling:
seed: 7
plan:
split:
kind: "holdout"
test_fraction: 0.0
val_fraction: 0.2
stratify: true
shuffle: true
labeling:
mode: "fraction"
value: 0.1
strategy: "balanced"
min_per_class: 1
imbalance:
kind: "none"
policy:
respect_official_test: true
allow_override_official: false
preprocess:
seed: 7
fit_on: "train_labeled"
cache: true
plan:
output_key: "features.X"
steps:
- id: "core.ensure_2d"
- id: "core.to_numpy"
graph:
enabled: true
seed: 7
cache: true
spec:
scheme: "knn"
metric: "euclidean"
k: 8
symmetrize: "mutual"
weights:
kind: "heat"
sigma: 1.0
normalize: "rw"
self_loops: true
backend: "numpy"
chunk_size: 128
feature_field: "features.X"
method:
kind: "transductive"
id: "label_propagation"
device:
device: "auto"
dtype: "float32"
params:
max_iter: 50
tol: 1.0e-4
normalize_rows: true
evaluation:
report_splits: ["val", "test"]
metrics: ["accuracy", "macro_f1"]
Example with augmentation: [26]
run:
name: best_text_inductive_softmatch_ag_news
seed: 2
output_dir: runs/inductive/softmatch/text/ag_news
log_level: detailed
fail_fast: true
dataset:
id: ag_news
download: true
options:
text_column: text
label_column: label
prefer_test_split: true
sampling:
seed: 2
plan:
split:
kind: holdout
test_fraction: 0.2
val_fraction: 0.1
stratify: true
shuffle: true
labeling:
mode: fraction
value: 0.2
strategy: balanced
min_per_class: 1
imbalance:
kind: none
policy:
respect_official_test: true
use_official_graph_masks: true
allow_override_official: false
preprocess:
seed: 2
fit_on: train_labeled
cache: true
plan:
output_key: features.X
steps:
- id: labels.encode
- id: text.ensure_strings
- id: text.sentence_transformer
params:
batch_size: 64
- id: core.pca
params:
n_components: 128
- id: core.to_torch
params:
device: "auto"
dtype: float32
augmentation:
enabled: true
seed: 2
mode: fixed
modality: tabular
weak:
steps:
- id: tabular.gaussian_noise
params:
std: 0.01
strong:
steps:
- id: tabular.feature_dropout
params:
p: 0.2
method:
kind: inductive
id: softmatch
device:
device: "auto"
dtype: float32
params:
lambda_u: 1.0
temperature: 0.5
ema_p: 0.999
n_sigma: 2.0
per_class: false
dist_align: true
dist_uniform: true
hard_label: true
use_cat: false
batch_size: 128
max_epochs: 50
detach_target: true
model:
classifier_id: mlp
classifier_backend: torch
classifier_params:
hidden_sizes:
- 128
activation: relu
dropout: 0.1
lr: 0.001
weight_decay: 0.0
batch_size: 256
max_epochs: 50
ema: false
evaluation:
split_for_model_selection: val
report_splits:
- val
- test
metrics:
- accuracy
- macro_f1
25.16 Related links¶
Sources
bench/configs/experiments/bench/schema.pybench/utils/io.pysrc/modssc/cli/_utils.pysrc/modssc/sampling/plan.pysrc/modssc/graph/specs.pysrc/modssc/preprocess/plan.pybench/main.pysrc/modssc/sampling/plan.pysrc/modssc/sampling/plan.pysrc/modssc/sampling/plan.pysrc/modssc/sampling/plan.pysrc/modssc/sampling/plan.pysrc/modssc/sampling/plan.pysrc/modssc/preprocess/plan.pysrc/modssc/preprocess/catalog.pysrc/modssc/views/plan.pysrc/modssc/data_augmentation/plan.pysrc/modssc/data_augmentation/registry.pysrc/modssc/data_augmentation/ops/src/modssc/data_augmentation/registry.pysrc/modssc/data_augmentation/ops/src/modssc/hpo/samplers.pybench/configs/experiments/toy_inductive.yamlbench/configs/experiments/toy_transductive.yamlbench/configs/experiments/minimal/inductive/pseudo_label/text/imdb.yaml