24. Skip to content

24. Graph API

This page documents the graph API. For workflows, see Graph how-to.

24.1 What it is for

The graph brick constructs similarity graphs and derives graph-based feature views. [1][2]

24.2 Examples

Build a kNN graph:

import numpy as np
from modssc.graph import GraphBuilderSpec, build_graph

X = np.random.randn(20, 8).astype(np.float32)
spec = GraphBuilderSpec(scheme="knn", metric="cosine", k=3)
G = build_graph(X, spec=spec, seed=0, cache=False)
print(G.n_nodes, G.n_edges)

Generate graph views:

from modssc.graph import GraphFeaturizerSpec, graph_to_views
from modssc.graph.artifacts import NodeDataset

node_ds = NodeDataset(X=X, y=np.zeros((20,), dtype=np.int64), graph=G, masks={})
fspec = GraphFeaturizerSpec(views=("attr", "diffusion"))
views = graph_to_views(node_ds, spec=fspec, seed=0, cache=False)
print(list(views.views.keys()))

Specs and artifacts are defined in src/modssc/graph/specs.py and src/modssc/graph/artifacts.py. [3][4]

24.3 API reference

Graph utilities for ModSSC.

This package provides:

  • Graph construction: build a similarity graph from feature vectors (kNN, epsilon-ball, and anchor graphs).
  • Graph featurization: derive tabular views from a graph (attribute, diffusion, and structural embeddings).
  • Cache and fingerprints for reproducibility.

The graph representation is backend-agnostic. Optional backends exist (sklearn, faiss).

24.4 DatasetViews dataclass

One or more tabular views derived from a dataset.

Source code in src/modssc/graph/artifacts.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
@dataclass(frozen=True)
class DatasetViews:
    """One or more tabular views derived from a dataset."""

    views: dict[str, Any]
    y: np.ndarray
    masks: dict[str, np.ndarray] = field(default_factory=dict)
    meta: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self) -> None:
        y = _as_int64(self.y)
        object.__setattr__(self, "y", y)

        if not self.views:
            raise GraphValidationError("views cannot be empty")

        # validate consistent first dimension
        n: int | None = None
        for name, v in self.views.items():
            if not hasattr(v, "shape"):
                raise GraphValidationError(f"view {name!r} must expose a shape attribute")
            if int(v.ndim) != 2:
                raise GraphValidationError(f"view {name!r} must be 2D")
            if n is None:
                n = int(v.shape[0])
            elif int(v.shape[0]) != n:
                raise GraphValidationError("All views must have the same number of samples")

        assert n is not None
        if y.shape[0] != n:
            raise GraphValidationError("y must have the same first dimension as views")

        new_masks: dict[str, np.ndarray] = {}
        for k, v in self.masks.items():
            m = np.asarray(v, dtype=bool)
            if m.ndim != 1 or m.shape[0] != n:
                raise GraphValidationError(f"Mask {k!r} must have shape (n,)")
            new_masks[str(k)] = m
        object.__setattr__(self, "masks", new_masks)

24.5 GraphArtifact dataclass

Canonical graph representation.

24.5.0.1 Notes

For reproducible experiments, graph construction should be fingerprinted and cached (see :func:modssc.graph.build_graph).

Source code in src/modssc/graph/artifacts.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@dataclass(frozen=True)
class GraphArtifact:
    """Canonical graph representation.

    Notes
    -----
    For reproducible experiments, graph construction should be fingerprinted and cached
    (see :func:`modssc.graph.build_graph`).
    """

    n_nodes: int
    edge_index: np.ndarray
    edge_weight: np.ndarray | None = None

    directed: bool = True
    meta: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self) -> None:
        ei = _as_int64(self.edge_index)
        if ei.ndim != 2 or ei.shape[0] != 2:
            raise GraphValidationError("edge_index must have shape (2, E)")
        if ei.size and (ei.min() < 0 or ei.max() >= self.n_nodes):
            raise GraphValidationError("edge_index contains node ids outside [0, n_nodes)")
        object.__setattr__(self, "edge_index", ei)

        if self.edge_weight is not None:
            ew = _as_float32(self.edge_weight)
            if ew.ndim != 1 or ew.shape[0] != ei.shape[1]:
                raise GraphValidationError("edge_weight must have shape (E,)")
            object.__setattr__(self, "edge_weight", ew)

    @property
    def n_edges(self) -> int:
        return int(self.edge_index.shape[1])

    def to_dict(self) -> dict[str, Any]:
        return {
            "n_nodes": int(self.n_nodes),
            "n_edges": int(self.n_edges),
            "directed": bool(self.directed),
            "has_edge_weight": self.edge_weight is not None,
            "meta": dict(self.meta),
        }

24.6 GraphBuilderSpec dataclass

Graph construction specification.

24.6.0.1 Notes

This spec is designed to be serializable (via :meth:to_dict) and stable, so that it can be fingerprinted for reproducibility.

Adds: - anchor scheme (approximate kNN via anchors) - faiss backend (optional) - chunk_size knob (for chunked numpy computations and resumable work dirs)

Source code in src/modssc/graph/specs.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
@dataclass(frozen=True)
class GraphBuilderSpec:
    """Graph construction specification.

    Notes
    -----
    This spec is designed to be serializable (via :meth:`to_dict`) and stable,
    so that it can be fingerprinted for reproducibility.

    Adds:
    - anchor scheme (approximate kNN via anchors)
    - faiss backend (optional)
    - chunk_size knob (for chunked numpy computations and resumable work dirs)
    """

    # main knobs
    scheme: Scheme = "knn"
    metric: Metric = "cosine"

    # scheme parameters
    k: int | None = 30
    radius: float | None = None  # epsilon

    # post-processing
    symmetrize: Symmetrize = "mutual"
    weights: GraphWeightsSpec = GraphWeightsSpec("heat", sigma=0.5)
    normalize: Normalize = "rw"
    self_loops: bool = True

    # backend selection
    backend: Backend = "auto"
    chunk_size: int = 512

    # where to read features from (when using higher-level orchestration)
    feature_field: str = "features.X"

    # anchor scheme
    n_anchors: int | None = None
    anchors_k: int = 5
    anchors_method: AnchorMethod = "random"
    candidate_limit: int = 1000

    # faiss backend (optional dependency)
    faiss_exact: bool = False
    faiss_hnsw_m: int = 32
    faiss_ef_search: int = 64
    faiss_ef_construction: int = 200

    def validate(self) -> None:
        if self.metric not in ("cosine", "euclidean"):
            raise GraphValidationError(f"Unknown metric: {self.metric!r}")

        if self.scheme == "knn":
            if self.k is None or int(self.k) <= 0:
                raise GraphValidationError("k must be a positive integer for knn scheme")
        elif self.scheme == "epsilon":
            if self.radius is None or float(self.radius) <= 0:
                raise GraphValidationError("radius must be > 0 for epsilon scheme")
        elif self.scheme == "anchor":
            if self.k is None or int(self.k) <= 0:
                raise GraphValidationError(
                    "k must be a positive integer for anchor scheme (final neighbors)"
                )
            if int(self.anchors_k) <= 0:
                raise GraphValidationError("anchors_k must be a positive integer")
            if self.n_anchors is not None and int(self.n_anchors) <= 0:
                raise GraphValidationError("n_anchors must be a positive integer when provided")
            if int(self.candidate_limit) <= 0:
                raise GraphValidationError("candidate_limit must be > 0")
            if self.anchors_method not in ("random", "kmeans"):
                raise GraphValidationError(f"Unknown anchors_method: {self.anchors_method!r}")
        else:
            raise GraphValidationError(f"Unknown scheme: {self.scheme!r}")

        if self.symmetrize not in ("none", "or", "mutual"):
            raise GraphValidationError(f"Unknown symmetrize mode: {self.symmetrize!r}")
        if self.normalize not in ("none", "rw", "sym"):
            raise GraphValidationError(f"Unknown normalize mode: {self.normalize!r}")

        if self.backend not in ("auto", "numpy", "sklearn", "faiss"):
            raise GraphValidationError(f"Unknown backend: {self.backend!r}")

        if int(self.chunk_size) <= 0:
            raise GraphValidationError("chunk_size must be > 0")

        # backend-specific constraints
        if self.backend == "faiss" and self.scheme == "epsilon":
            raise GraphValidationError("faiss backend does not support epsilon scheme")

        if int(self.faiss_hnsw_m) <= 0:
            raise GraphValidationError("faiss_hnsw_m must be > 0")
        if int(self.faiss_ef_search) <= 0:
            raise GraphValidationError("faiss_ef_search must be > 0")
        if int(self.faiss_ef_construction) <= 0:
            raise GraphValidationError("faiss_ef_construction must be > 0")

        self.weights.validate(metric=self.metric)

    def to_dict(self) -> dict[str, Any]:
        return {
            "scheme": self.scheme,
            "metric": self.metric,
            "k": self.k,
            "radius": self.radius,
            "symmetrize": self.symmetrize,
            "weights": self.weights.to_dict(),
            "normalize": self.normalize,
            "self_loops": self.self_loops,
            "backend": self.backend,
            "chunk_size": int(self.chunk_size),
            "feature_field": self.feature_field,
            "n_anchors": self.n_anchors,
            "anchors_k": int(self.anchors_k),
            "anchors_method": self.anchors_method,
            "candidate_limit": int(self.candidate_limit),
            "faiss_exact": bool(self.faiss_exact),
            "faiss_hnsw_m": int(self.faiss_hnsw_m),
            "faiss_ef_search": int(self.faiss_ef_search),
            "faiss_ef_construction": int(self.faiss_ef_construction),
        }

    @classmethod
    def from_dict(cls, d: dict[str, Any]) -> GraphBuilderSpec:
        # keep backward compatibility: missing keys fall back to legacy defaults
        return cls(
            scheme=str(d.get("scheme", "knn")),  # type: ignore[arg-type]
            metric=str(d.get("metric", "cosine")),  # type: ignore[arg-type]
            k=d.get("k"),
            radius=d.get("radius"),
            symmetrize=str(d.get("symmetrize", "mutual")),  # type: ignore[arg-type]
            weights=GraphWeightsSpec.from_dict(dict(d.get("weights", {}))),
            normalize=str(d.get("normalize", "rw")),  # type: ignore[arg-type]
            self_loops=bool(d.get("self_loops", True)),
            backend=str(d.get("backend", "auto")),  # type: ignore[arg-type]
            chunk_size=int(d.get("chunk_size", 512)),
            feature_field=str(d.get("feature_field", "features.X")),
            n_anchors=d.get("n_anchors"),
            anchors_k=int(d.get("anchors_k", 5)),
            anchors_method=str(d.get("anchors_method", "random")),  # type: ignore[arg-type]
            candidate_limit=int(d.get("candidate_limit", 1000)),
            faiss_exact=bool(d.get("faiss_exact", False)),
            faiss_hnsw_m=int(d.get("faiss_hnsw_m", 32)),
            faiss_ef_search=int(d.get("faiss_ef_search", 64)),
            faiss_ef_construction=int(d.get("faiss_ef_construction", 200)),
        )

24.7 GraphFeaturizerSpec dataclass

Featurization spec to produce inductive views from a graph.

24.7.0.1 Views

attr: returns the original attribute matrix X diffusion: returns a simple diffusion of attributes over the graph struct: returns structural embeddings (DeepWalk/Node2Vec-style) computed from the graph only (X is ignored).

24.7.0.2 Notes

  • The struct view is deterministic given the seed.
  • For large graphs, struct view may require optional dependencies.
Source code in src/modssc/graph/specs.py
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
@dataclass(frozen=True)
class GraphFeaturizerSpec:
    """Featurization spec to produce inductive views from a graph.

    Views
    -----
    attr:
        returns the original attribute matrix X
    diffusion:
        returns a simple diffusion of attributes over the graph
    struct:
        returns structural embeddings (DeepWalk/Node2Vec-style) computed from the graph
        only (X is ignored).

    Notes
    -----
    - The struct view is deterministic given the seed.
    - For large graphs, struct view may require optional dependencies.
    """

    views: tuple[ViewName, ...] = ("attr",)

    # diffusion
    diffusion_steps: int = 5
    diffusion_alpha: float = 0.1

    # struct
    struct_method: StructMethod = "deepwalk"
    struct_dim: int = 64
    walk_length: int = 40
    num_walks_per_node: int = 10
    window_size: int = 5
    p: float = 1.0
    q: float = 1.0

    cache: bool = True

    def validate(self) -> None:
        if self.diffusion_steps < 0:
            raise GraphValidationError("diffusion_steps must be >= 0")
        if not (0.0 <= float(self.diffusion_alpha) <= 1.0):
            raise GraphValidationError("diffusion_alpha must be in [0, 1]")

        if not self.views:
            raise GraphValidationError("views cannot be empty")
        for v in self.views:
            if v not in ("attr", "diffusion", "struct"):
                raise GraphValidationError(f"Unknown view: {v!r}")

        if self.struct_method not in ("deepwalk", "node2vec"):
            raise GraphValidationError(f"Unknown struct_method: {self.struct_method!r}")
        if int(self.struct_dim) <= 0:
            raise GraphValidationError("struct_dim must be > 0")
        if int(self.walk_length) <= 1:
            raise GraphValidationError("walk_length must be > 1")
        if int(self.num_walks_per_node) <= 0:
            raise GraphValidationError("num_walks_per_node must be > 0")
        if int(self.window_size) <= 0:
            raise GraphValidationError("window_size must be > 0")
        if float(self.p) <= 0:
            raise GraphValidationError("p must be > 0")
        if float(self.q) <= 0:
            raise GraphValidationError("q must be > 0")

    def to_dict(self) -> dict[str, Any]:
        return {
            "views": list(self.views),
            "diffusion_steps": int(self.diffusion_steps),
            "diffusion_alpha": float(self.diffusion_alpha),
            "struct_method": self.struct_method,
            "struct_dim": int(self.struct_dim),
            "walk_length": int(self.walk_length),
            "num_walks_per_node": int(self.num_walks_per_node),
            "window_size": int(self.window_size),
            "p": float(self.p),
            "q": float(self.q),
            "cache": bool(self.cache),
        }

    @classmethod
    def from_dict(cls, d: dict[str, Any]) -> GraphFeaturizerSpec:
        views = tuple(d.get("views", ["attr"]))
        return cls(
            views=views,  # type: ignore[arg-type]
            diffusion_steps=int(d.get("diffusion_steps", 5)),
            diffusion_alpha=float(d.get("diffusion_alpha", 0.1)),
            struct_method=str(d.get("struct_method", "deepwalk")),  # type: ignore[arg-type]
            struct_dim=int(d.get("struct_dim", 64)),
            walk_length=int(d.get("walk_length", 40)),
            num_walks_per_node=int(d.get("num_walks_per_node", 10)),
            window_size=int(d.get("window_size", 5)),
            p=float(d.get("p", 1.0)),
            q=float(d.get("q", 1.0)),
            cache=bool(d.get("cache", True)),
        )

24.8 GraphWeightsSpec dataclass

Specification for edge weights.

24.8.0.1 Parameters

kind: - "binary": all edges weight 1 - "heat": exp(-d^2/(2*sigma^2)) - "cosine": convert cosine distances into similarities (1 - d) sigma: Used only for kind="heat".

Source code in src/modssc/graph/specs.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@dataclass(frozen=True)
class GraphWeightsSpec:
    """Specification for edge weights.

    Parameters
    ----------
    kind:
        - "binary": all edges weight 1
        - "heat": exp(-d^2/(2*sigma^2))
        - "cosine": convert cosine distances into similarities (1 - d)
    sigma:
        Used only for kind="heat".
    """

    kind: WeightKind = "binary"
    sigma: float | None = None

    def validate(self, *, metric: Metric) -> None:
        if self.kind not in ("binary", "heat", "cosine"):
            raise GraphValidationError(f"Unknown weight kind: {self.kind!r}")
        if self.kind == "heat":
            sigma = float(self.sigma or 0.0)
            if sigma <= 0:
                raise GraphValidationError("sigma must be > 0 for heat weights")
        if self.kind == "cosine" and metric != "cosine":
            raise GraphValidationError("cosine weights require metric='cosine'")

    def to_dict(self) -> dict[str, Any]:
        return {"kind": self.kind, "sigma": self.sigma}

    @classmethod
    def from_dict(cls, d: dict[str, Any]) -> GraphWeightsSpec:
        return cls(kind=str(d.get("kind", "binary")), sigma=d.get("sigma"))

24.9 NodeDataset dataclass

Node classification dataset for transductive methods.

Source code in src/modssc/graph/artifacts.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
@dataclass(frozen=True)
class NodeDataset:
    """Node classification dataset for transductive methods."""

    X: Any
    y: np.ndarray
    graph: GraphArtifact
    masks: dict[str, np.ndarray] = field(default_factory=dict)
    meta: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self) -> None:
        # Validate X first dimension (works for numpy and scipy sparse)
        if not hasattr(self.X, "shape"):
            raise GraphValidationError("X must expose a shape attribute")
        if int(self.X.shape[0]) != int(self.graph.n_nodes):
            raise GraphValidationError("X must have shape (n_nodes, d)")

        y = _as_int64(self.y)
        if y.ndim not in (1, 2):
            raise GraphValidationError("y must have shape (n,) or (n, C)")
        if y.shape[0] != self.graph.n_nodes:
            raise GraphValidationError("y must have the same first dimension as graph.n_nodes")
        object.__setattr__(self, "y", y)

        new_masks: dict[str, np.ndarray] = {}
        for k, v in self.masks.items():
            m = np.asarray(v, dtype=bool)
            if m.ndim != 1 or m.shape[0] != self.graph.n_nodes:
                raise GraphValidationError(f"Mask {k!r} must have shape (n_nodes,)")
            new_masks[str(k)] = m
        object.__setattr__(self, "masks", new_masks)

24.10 build_graph(X, *, spec, seed=0, dataset_fingerprint=None, preprocess_fingerprint=None, cache=True, cache_dir=None, edge_shard_size=None, resume=True)

Build a graph from a dense feature matrix.

24.10.0.1 Parameters

X: A 2D dense array-like of shape (n_nodes, n_features). spec: GraphBuilderSpec controlling scheme/backend/weights/normalization. seed: Seed used for deterministic components (notably the anchor scheme). dataset_fingerprint: Optional precomputed fingerprint for X (useful when X is already cached upstream). preprocess_fingerprint: Optional fingerprint of the preprocessing pipeline. cache: Whether to cache the built graph on disk. cache_dir: Override the default cache directory. edge_shard_size: If provided, store the edge arrays in sharded .npz files with at most this many edges per shard. resume: If True and cache=True, partial numpy chunk computations are resumed from the cache entry work directory when available.

24.10.0.2 Returns

GraphArtifact

Source code in src/modssc/graph/construction/api.py
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
def build_graph(
    X: Any,
    *,
    spec: GraphBuilderSpec,
    seed: int = 0,
    dataset_fingerprint: str | None = None,
    preprocess_fingerprint: str | None = None,
    cache: bool = True,
    cache_dir: str | Path | None = None,
    edge_shard_size: int | None = None,
    resume: bool = True,
) -> GraphArtifact:
    """Build a graph from a dense feature matrix.

    Parameters
    ----------
    X:
        A 2D dense array-like of shape (n_nodes, n_features).
    spec:
        GraphBuilderSpec controlling scheme/backend/weights/normalization.
    seed:
        Seed used for deterministic components (notably the anchor scheme).
    dataset_fingerprint:
        Optional precomputed fingerprint for X (useful when X is already cached upstream).
    preprocess_fingerprint:
        Optional fingerprint of the preprocessing pipeline.
    cache:
        Whether to cache the built graph on disk.
    cache_dir:
        Override the default cache directory.
    edge_shard_size:
        If provided, store the edge arrays in sharded `.npz` files with at most this many
        edges per shard.
    resume:
        If True and `cache=True`, partial numpy chunk computations are resumed from the
        cache entry work directory when available.

    Returns
    -------
    GraphArtifact
    """
    start = perf_counter()
    validate_features(X)
    validate_builder_spec(spec)

    X_arr = np.asarray(X)
    n_nodes = int(X_arr.shape[0])

    ds_fp = dataset_fingerprint or fingerprint_array(X_arr)
    spec_fp = fingerprint_dict(spec.to_dict())
    g_fp = _graph_fingerprint(
        dataset_fingerprint=ds_fp,
        preprocess_fingerprint=preprocess_fingerprint,
        spec=spec,
        seed=int(seed),
    )

    cache_store = GraphCache(
        root=Path(cache_dir) if cache_dir is not None else GraphCache.default().root,
        edge_shard_size=edge_shard_size,
    )

    if cache and cache_store.exists(g_fp):
        graph, _ = cache_store.load(g_fp)
        logger.info(
            "Graph cached: fingerprint=%s n_nodes=%s n_edges=%s duration_s=%.3f",
            g_fp,
            graph.n_nodes,
            int(graph.edge_index.shape[1]),
            perf_counter() - start,
        )
        return graph

    # Optional resumable work directory inside the cache entry (only used by numpy backend).
    work_dir: Path | None = None
    if cache and resume:
        work_dir = cache_store.entry_dir(g_fp) / "_work"
        work_dir.mkdir(parents=True, exist_ok=True)

    # Build raw edges + distances
    logger.info(
        "Graph build start: scheme=%s metric=%s backend=%s n_nodes=%s seed=%s",
        spec.scheme,
        spec.metric,
        spec.backend,
        n_nodes,
        seed,
    )
    if spec.scheme == "knn" and spec.k is not None and int(spec.k) <= 1:
        logger.warning("Graph spec k is very small: k=%s", spec.k)
    if spec.scheme == "epsilon" and spec.radius is not None and float(spec.radius) <= 0:
        logger.warning("Graph spec radius is non-positive: radius=%s", spec.radius)

    edge_index, distances = build_raw_edges(
        X_arr,
        spec=spec,
        seed=int(seed),
        work_dir=work_dir,
        resume=bool(resume),
    )

    # Turn distances into weights
    edge_weight = compute_edge_weights(
        distances=distances, weights=spec.weights, metric=spec.metric
    )

    # Post-process graph
    if spec.symmetrize != "none":
        edge_index, edge_weight = symmetrize_edges(
            n_nodes=n_nodes,
            edge_index=edge_index,
            edge_weight=edge_weight,
            mode=spec.symmetrize,
        )

    if spec.self_loops:
        edge_index, edge_weight = add_self_loops(
            n_nodes=n_nodes, edge_index=edge_index, edge_weight=edge_weight
        )

    if spec.normalize != "none":
        edge_weight = normalize_edge_weights(
            n_nodes=n_nodes, edge_index=edge_index, edge_weight=edge_weight, mode=spec.normalize
        )

    if edge_weight is not None and not np.isfinite(edge_weight).all():
        raise GraphValidationError(
            "Non-finite edge weights detected (check input features and spec)"
        )

    graph = GraphArtifact(
        n_nodes=n_nodes,
        edge_index=edge_index,
        edge_weight=edge_weight,
        directed=(spec.symmetrize == "none"),
        meta={
            "fingerprint": g_fp,
            "dataset_fingerprint": ds_fp,
            "preprocess_fingerprint": preprocess_fingerprint,
            "spec_fingerprint": spec_fp,
            "seed": int(seed),
        },
    )

    if cache:
        manifest = {
            "fingerprint": g_fp,
            "dataset_fingerprint": ds_fp,
            "preprocess_fingerprint": preprocess_fingerprint,
            "spec": spec.to_dict(),
            "spec_fingerprint": spec_fp,
            "seed": int(seed),
        }
        cache_store.save(fingerprint=g_fp, graph=graph, manifest=manifest, overwrite=True)

    duration = perf_counter() - start
    logger.info(
        "Graph build done: fingerprint=%s n_nodes=%s n_edges=%s duration_s=%.3f",
        g_fp,
        n_nodes,
        int(edge_index.shape[1]),
        duration,
    )
    if logger.isEnabledFor(logging.DEBUG) and edge_index.size and edge_index.shape[1] <= 5_000_000:
        min_deg, mean_deg, max_deg, zero_deg = _degree_summary(edge_index, n_nodes)
        logger.debug(
            "Graph degrees: min=%s mean=%.2f max=%s zero=%s",
            min_deg,
            mean_deg,
            max_deg,
            zero_deg,
        )
        if n_nodes and zero_deg / float(n_nodes) > 0.2:
            logger.warning("Graph has many isolated nodes: zero_degree=%s", zero_deg)

    return graph

24.11 graph_to_views(dataset, *, spec, seed=0, cache=None, cache_dir=None)

Compute one or more views from a (graph, X) dataset.

Source code in src/modssc/graph/featurization/api.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def graph_to_views(
    dataset: NodeDataset,
    *,
    spec: GraphFeaturizerSpec,
    seed: int = 0,
    cache: bool | None = None,
    cache_dir: str | Path | None = None,
) -> DatasetViews:
    """Compute one or more views from a (graph, X) dataset."""
    start = perf_counter()
    validate_featurizer_spec(spec)

    graph_fp = str(dataset.graph.meta.get("fingerprint", "")) if dataset.graph.meta else ""
    if not graph_fp:
        # fallback: fingerprint of graph structure not available
        graph_fp = fingerprint_dict(
            {
                "n_nodes": int(dataset.graph.n_nodes),
                "edge_index": dataset.graph.edge_index[
                    :2, : min(1000, dataset.graph.edge_index.shape[1])
                ].tolist(),
            }
        )

    views_fp = _views_fingerprint(graph_fingerprint=graph_fp, spec=spec, seed=int(seed))

    cache_enabled = bool(spec.cache) if cache is None else bool(cache)
    cache_store = ViewsCache(
        root=Path(cache_dir) if cache_dir is not None else ViewsCache.default().root
    )

    if cache_enabled and cache_store.exists(views_fp):
        cached, _ = cache_store.load(views_fp, y=np.asarray(dataset.y), masks=dataset.masks)
        logger.info(
            "Graph views cached: fingerprint=%s views=%s duration_s=%.3f",
            views_fp,
            list(spec.views),
            perf_counter() - start,
        )
        return cached

    views: dict[str, np.ndarray] = {}
    for name in spec.views:
        step_start = perf_counter()
        if name == "attr":
            views["attr"] = attr_view(dataset.X)
        elif name == "diffusion":
            views["diffusion"] = diffusion_view(
                X=np.asarray(dataset.X),
                n_nodes=int(dataset.graph.n_nodes),
                edge_index=np.asarray(dataset.graph.edge_index),
                edge_weight=(
                    np.asarray(dataset.graph.edge_weight)
                    if dataset.graph.edge_weight is not None
                    else None
                ),
                steps=int(spec.diffusion_steps),
                alpha=float(spec.diffusion_alpha),
            )
        elif name == "struct":
            sp = StructParams(
                method=spec.struct_method,
                dim=int(spec.struct_dim),
                walk_length=int(spec.walk_length),
                num_walks_per_node=int(spec.num_walks_per_node),
                window_size=int(spec.window_size),
                p=float(spec.p),
                q=float(spec.q),
            )
            views["struct"] = struct_embeddings(
                edge_index=dataset.graph.edge_index,
                n_nodes=int(dataset.graph.n_nodes),
                params=sp,
                seed=int(seed),
            )
        else:
            raise ValueError(f"Unknown view: {name!r}")
        logger.debug("Graph view built: name=%s duration_s=%.3f", name, perf_counter() - step_start)

    # validate output
    for k, v in views.items():
        validate_view_matrix(v, n_nodes=int(dataset.graph.n_nodes), name=k)

    out = DatasetViews(
        views=views,
        y=np.asarray(dataset.y),
        masks=dataset.masks,
        meta={
            "fingerprint": views_fp,
            "graph_fingerprint": graph_fp,
            "spec_fingerprint": fingerprint_dict(spec.to_dict()),
            "seed": int(seed),
        },
    )

    if cache_enabled:
        manifest = {
            "fingerprint": views_fp,
            "graph_fingerprint": graph_fp,
            "spec": spec.to_dict(),
            "spec_fingerprint": out.meta.get("spec_fingerprint"),
            "seed": int(seed),
        }
        cache_store.save(fingerprint=views_fp, views=out, manifest=manifest)

    logger.info(
        "Graph views done: fingerprint=%s views=%s duration_s=%.3f",
        views_fp,
        list(spec.views),
        perf_counter() - start,
    )
    return out
Sources
  1. src/modssc/graph/construction/api.py
  2. src/modssc/graph/featurization/api.py
  3. src/modssc/graph/specs.py
  4. src/modssc/graph/artifacts.py