16. Configuration reference¶
This reference documents the YAML schemas used by the benchmark runner and CLI plans. For runnable configs, see Benchmarks.
Start with where files live, then how they are loaded, then the schema blocks you can edit. The full examples at the end are complete configs you can copy and adapt.
16.1 Where configs live¶
-
Benchmark experiment configs:
bench/configs/experiments/(templates, toy configs, best/smoke configs). [1] -
Graph specs and preprocessing plans are embedded inside experiment configs when enabled. [1][2]
16.2 How configs are loaded¶
-
Bench configs are read from YAML with
bench.utils.io.load_yamland validated viaExperimentConfig.from_dict. [3][2] -
CLI plan/spec files use
load_yaml_or_jsonand are validated with their respectivefrom_dictmethods. [4][5][6] -
Preprocess plans are loaded from YAML with
modssc.preprocess.plan.load_plan. [7]
16.3 Examples¶
Use the CLI run when you want to execute a config with the benchmark runner, and use the Python snippet when you want to load and inspect the config object in code. [8][2][3]
Run a bench config:
python -m bench.main --config bench/configs/experiments/toy_inductive.yaml
Load and inspect a config in Python:
from bench.schema import ExperimentConfig
from bench.utils.io import load_yaml
cfg = ExperimentConfig.from_dict(load_yaml("bench/configs/experiments/toy_inductive.yaml"))
print(cfg.method.kind, cfg.method.method_id)
The bench entry point, schema, and YAML loader are in bench/main.py, bench/schema.py, and bench/utils/io.py. [8][2][3]
16.4 Experiment config schema (bench)¶
Top-level keys and their required fields are defined in bench/schema.py:
-
run:name,seed,output_dir,fail_fast, optionallog_level. [2] -
dataset:id, optionaloptions,download,cache_dir. [2] -
sampling:seed,plan(sampling plan mapping). [2] -
preprocess:seed,fit_on,cache,plan(preprocess plan mapping). [2] -
method:kind(inductive|transductive),id,device,params, optionalmodel. [2] -
evaluation:report_splits,metrics, optionalsplit_for_model_selection. [2] -
Optional blocks:
graph,views,augmentation,search. [2]
16.5 Sampling plan schema¶
Sampling plans are validated by SamplingPlan.from_dict and include:
- split: kind (holdout|kfold) plus parameters [9][10][11][12][13][14][5]
-
labeling:mode(fraction|count|per_class),value,strategy,min_per_class, optionalfixed_indices. [5] -
imbalance:kind(none|subsample_max_per_class|long_tail) and its parameters. [5] -
policy:respect_official_test,use_official_graph_masks,allow_override_official. [5]
16.6 Preprocess plan schema¶
A preprocess plan YAML contains:
- output_key [15][7]
steps: list of step mappings withid,params, optionalmodalities,requires_fields,enabled. [7]
Available step IDs and their metadata are listed in the built-in catalog. [16]
16.7 Views plan schema¶
Views plans define multiple feature views:
- views: list of view definitions with name, optional preprocess, columns, and meta. [17]
columnssupports modesall,indices,random,complement. [17]
16.8 Graph builder spec schema¶
Graph specs match GraphBuilderSpec:
- scheme: knn|epsilon|anchor
- metric: cosine|euclidean
- k or radius depending on scheme
- symmetrize, weights, normalize, self_loops
- backend: auto|numpy|sklearn|faiss
- anchor and faiss parameters
All validation rules are defined in src/modssc/graph/specs.py. [6]
16.9 Augmentation plan schema¶
Augmentation plans include:
- steps: list of ops with id and params [18][19][2][20]
modality: optional modality hint
Augmentation ops are registered in src/modssc/data_augmentation/ops/. [21][22]
16.10 Search (HPO) schema¶
The bench search block includes:
- enabled, kind (grid|random), seed, n_trials, repeats.
- objective: split, metric, direction, aggregate.
- space: nested mapping of method.params.* to lists or distributions.
Validation rules are enforced by bench/schema.py, and distributions are defined in src/modssc/hpo/samplers.py. [2][23]
16.11 Complete example configs¶
Toy inductive experiment : [24]
run:
name: "toy_pseudo_label_numpy"
seed: 42
output_dir: "runs"
fail_fast: true
dataset:
id: "toy"
sampling:
seed: 42
plan:
split:
kind: "holdout"
test_fraction: 0.0
val_fraction: 0.2
stratify: true
shuffle: true
labeling:
mode: "fraction"
value: 0.2
strategy: "balanced"
min_per_class: 1
imbalance:
kind: "none"
policy:
respect_official_test: true
allow_override_official: false
preprocess:
seed: 42
fit_on: "train_labeled"
cache: true
plan:
output_key: "features.X"
steps:
- id: "core.ensure_2d"
- id: "core.to_numpy"
method:
kind: "inductive"
id: "pseudo_label"
device:
device: "auto"
dtype: "float32"
params:
classifier_id: "knn"
classifier_backend: "numpy"
max_iter: 5
confidence_threshold: 0.8
evaluation:
split_for_model_selection: "val"
report_splits: ["val", "test"]
metrics: ["accuracy", "macro_f1"]
Toy transductive experiment : [25]
run:
name: "toy_label_propagation_knn"
seed: 7
output_dir: "runs"
fail_fast: true
dataset:
id: "toy"
sampling:
seed: 7
plan:
split:
kind: "holdout"
test_fraction: 0.0
val_fraction: 0.2
stratify: true
shuffle: true
labeling:
mode: "fraction"
value: 0.1
strategy: "balanced"
min_per_class: 1
imbalance:
kind: "none"
policy:
respect_official_test: true
allow_override_official: false
preprocess:
seed: 7
fit_on: "train_labeled"
cache: true
plan:
output_key: "features.X"
steps:
- id: "core.ensure_2d"
- id: "core.to_numpy"
graph:
enabled: true
seed: 7
cache: true
spec:
scheme: "knn"
metric: "euclidean"
k: 8
symmetrize: "mutual"
weights:
kind: "heat"
sigma: 1.0
normalize: "rw"
self_loops: true
backend: "numpy"
chunk_size: 128
feature_field: "features.X"
method:
kind: "transductive"
id: "label_propagation"
device:
device: "auto"
dtype: "float32"
params:
max_iter: 50
tol: 1.0e-4
normalize_rows: true
evaluation:
report_splits: ["val", "test"]
metrics: ["accuracy", "macro_f1"]
Example with augmentation : [26]
run:
name: best_text_inductive_softmatch_ag_news
seed: 2
output_dir: runs/inductive/softmatch/text/ag_news
log_level: detailed
fail_fast: true
dataset:
id: ag_news
download: true
options:
text_column: text
label_column: label
prefer_test_split: true
sampling:
seed: 2
plan:
split:
kind: holdout
test_fraction: 0.2
val_fraction: 0.1
stratify: true
shuffle: true
labeling:
mode: fraction
value: 0.2
strategy: balanced
min_per_class: 1
imbalance:
kind: none
policy:
respect_official_test: true
use_official_graph_masks: true
allow_override_official: false
preprocess:
seed: 2
fit_on: train_labeled
cache: true
plan:
output_key: features.X
steps:
- id: labels.encode
- id: text.ensure_strings
- id: text.sentence_transformer
params:
batch_size: 64
- id: core.pca
params:
n_components: 128
- id: core.to_torch
params:
device: "auto"
dtype: float32
augmentation:
enabled: true
seed: 2
mode: fixed
modality: tabular
weak:
steps:
- id: tabular.gaussian_noise
params:
std: 0.01
strong:
steps:
- id: tabular.feature_dropout
params:
p: 0.2
method:
kind: inductive
id: softmatch
device:
device: "auto"
dtype: float32
params:
lambda_u: 1.0
temperature: 0.5
ema_p: 0.999
n_sigma: 2.0
per_class: false
dist_align: true
dist_uniform: true
hard_label: true
use_cat: false
batch_size: 128
max_epochs: 50
detach_target: true
model:
classifier_id: mlp
classifier_backend: torch
classifier_params:
hidden_sizes:
- 128
activation: relu
dropout: 0.1
lr: 0.001
weight_decay: 0.0
batch_size: 256
max_epochs: 50
ema: false
evaluation:
split_for_model_selection: val
report_splits:
- val
- test
metrics:
- accuracy
- macro_f1
Sources
bench/configs/experiments/bench/schema.pybench/utils/io.pysrc/modssc/cli/_utils.pysrc/modssc/sampling/plan.pysrc/modssc/graph/specs.pysrc/modssc/preprocess/plan.pybench/main.pytest_fractionval_fractionkfoldstratifyshufflefeatures.Xsrc/modssc/preprocess/catalog.pysrc/modssc/views/plan.pyidop_idsrc/modssc/data_augmentation/plan.pysrc/modssc/data_augmentation/registry.pysrc/modssc/data_augmentation/ops/src/modssc/hpo/samplers.pybench/configs/experiments/toy_inductive.yamlbench/configs/experiments/toy_transductive.yamlbench/configs/experiments/best/inductive/softmatch/text/ag_news.yaml