24. Graph API¶

This page documents the graph API. For workflows, see Graph how-to.

24.1 What it is for¶

The graph brick constructs similarity graphs and derives graph-based feature views. ^[1][2]

24.2 Examples¶

Build a kNN graph:

import numpy as np
from modssc.graph import GraphBuilderSpec, build_graph

X = np.random.randn(20, 8).astype(np.float32)
spec = GraphBuilderSpec(scheme="knn", metric="cosine", k=3)
G = build_graph(X, spec=spec, seed=0, cache=False)
print(G.n_nodes, G.n_edges)

Generate graph views:

from modssc.graph import GraphFeaturizerSpec, graph_to_views
from modssc.graph.artifacts import NodeDataset

node_ds = NodeDataset(X=X, y=np.zeros((20,), dtype=np.int64), graph=G, masks={})
fspec = GraphFeaturizerSpec(views=("attr", "diffusion"))
views = graph_to_views(node_ds, spec=fspec, seed=0, cache=False)
print(list(views.views.keys()))

Specs and artifacts are defined in src/modssc/graph/specs.py and src/modssc/graph/artifacts.py. ^[3][4]

24.3 API reference¶

Graph utilities for ModSSC.

This package provides:

Graph construction: build a similarity graph from feature vectors (kNN, epsilon-ball, and anchor graphs).
Graph featurization: derive tabular views from a graph (attribute, diffusion, and structural embeddings).
Cache and fingerprints for reproducibility.

The graph representation is backend-agnostic. Optional backends exist (sklearn, faiss).

24.4 `DatasetViews` `dataclass` ¶

One or more tabular views derived from a dataset.

Source code in src/modssc/graph/artifacts.py

@dataclass(frozen=True)
class DatasetViews:
    """One or more tabular views derived from a dataset."""

    views: dict[str, Any]
    y: np.ndarray
    masks: dict[str, np.ndarray] = field(default_factory=dict)
    meta: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self) -> None:
        y = _as_int64(self.y)
        object.__setattr__(self, "y", y)

        if not self.views:
            raise GraphValidationError("views cannot be empty")

        # validate consistent first dimension
        n: int | None = None
        for name, v in self.views.items():
            if not hasattr(v, "shape"):
                raise GraphValidationError(f"view {name!r} must expose a shape attribute")
            if int(v.ndim) != 2:
                raise GraphValidationError(f"view {name!r} must be 2D")
            if n is None:
                n = int(v.shape[0])
            elif int(v.shape[0]) != n:
                raise GraphValidationError("All views must have the same number of samples")

        assert n is not None
        if y.shape[0] != n:
            raise GraphValidationError("y must have the same first dimension as views")

        new_masks: dict[str, np.ndarray] = {}
        for k, v in self.masks.items():
            m = np.asarray(v, dtype=bool)
            if m.ndim != 1 or m.shape[0] != n:
                raise GraphValidationError(f"Mask {k!r} must have shape (n,)")
            new_masks[str(k)] = m
        object.__setattr__(self, "masks", new_masks)

24.5 `GraphArtifact` `dataclass` ¶

Canonical graph representation.

24.5.0.1 Notes¶

For reproducible experiments, graph construction should be fingerprinted and cached (see :func:modssc.graph.build_graph).

Source code in src/modssc/graph/artifacts.py

@dataclass(frozen=True)
class GraphArtifact:
    """Canonical graph representation.

    Notes
    -----
    For reproducible experiments, graph construction should be fingerprinted and cached
    (see :func:`modssc.graph.build_graph`).
    """

    n_nodes: int
    edge_index: np.ndarray
    edge_weight: np.ndarray | None = None

    directed: bool = True
    meta: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self) -> None:
        ei = _as_int64(self.edge_index)
        if ei.ndim != 2 or ei.shape[0] != 2:
            raise GraphValidationError("edge_index must have shape (2, E)")
        if ei.size and (ei.min() < 0 or ei.max() >= self.n_nodes):
            raise GraphValidationError("edge_index contains node ids outside [0, n_nodes)")
        object.__setattr__(self, "edge_index", ei)

        if self.edge_weight is not None:
            ew = _as_float32(self.edge_weight)
            if ew.ndim != 1 or ew.shape[0] != ei.shape[1]:
                raise GraphValidationError("edge_weight must have shape (E,)")
            object.__setattr__(self, "edge_weight", ew)

    @property
    def n_edges(self) -> int:
        return int(self.edge_index.shape[1])

    def to_dict(self) -> dict[str, Any]:
        return {
            "n_nodes": int(self.n_nodes),
            "n_edges": int(self.n_edges),
            "directed": bool(self.directed),
            "has_edge_weight": self.edge_weight is not None,
            "meta": dict(self.meta),
        }

24.6 `GraphBuilderSpec` `dataclass` ¶

Graph construction specification.

24.6.0.1 Notes¶

This spec is designed to be serializable (via :meth:to_dict) and stable, so that it can be fingerprinted for reproducibility.

Adds: - anchor scheme (approximate kNN via anchors) - faiss backend (optional) - chunk_size knob (for chunked numpy computations and resumable work dirs)

Source code in src/modssc/graph/specs.py

@dataclass(frozen=True)
class GraphBuilderSpec:
    """Graph construction specification.

    Notes
    -----
    This spec is designed to be serializable (via :meth:`to_dict`) and stable,
    so that it can be fingerprinted for reproducibility.

    Adds:
    - anchor scheme (approximate kNN via anchors)
    - faiss backend (optional)
    - chunk_size knob (for chunked numpy computations and resumable work dirs)
    """

    # main knobs
    scheme: Scheme = "knn"
    metric: Metric = "cosine"

    # scheme parameters
    k: int | None = 30
    radius: float | None = None  # epsilon

    # post-processing
    symmetrize: Symmetrize = "mutual"
    weights: GraphWeightsSpec = GraphWeightsSpec("heat", sigma=0.5)
    normalize: Normalize = "rw"
    self_loops: bool = True

    # backend selection
    backend: Backend = "auto"
    chunk_size: int = 512

    # where to read features from (when using higher-level orchestration)
    feature_field: str = "features.X"

    # anchor scheme
    n_anchors: int | None = None
    anchors_k: int = 5
    anchors_method: AnchorMethod = "random"
    candidate_limit: int = 1000

    # faiss backend (optional dependency)
    faiss_exact: bool = False
    faiss_hnsw_m: int = 32
    faiss_ef_search: int = 64
    faiss_ef_construction: int = 200

    def validate(self) -> None:
        if self.metric not in ("cosine", "euclidean"):
            raise GraphValidationError(f"Unknown metric: {self.metric!r}")

        if self.scheme == "knn":
            if self.k is None or int(self.k) <= 0:
                raise GraphValidationError("k must be a positive integer for knn scheme")
        elif self.scheme == "epsilon":
            if self.radius is None or float(self.radius) <= 0:
                raise GraphValidationError("radius must be > 0 for epsilon scheme")
        elif self.scheme == "anchor":
            if self.k is None or int(self.k) <= 0:
                raise GraphValidationError(
                    "k must be a positive integer for anchor scheme (final neighbors)"
                )
            if int(self.anchors_k) <= 0:
                raise GraphValidationError("anchors_k must be a positive integer")
            if self.n_anchors is not None and int(self.n_anchors) <= 0:
                raise GraphValidationError("n_anchors must be a positive integer when provided")
            if int(self.candidate_limit) <= 0:
                raise GraphValidationError("candidate_limit must be > 0")
            if self.anchors_method not in ("random", "kmeans"):
                raise GraphValidationError(f"Unknown anchors_method: {self.anchors_method!r}")
        else:
            raise GraphValidationError(f"Unknown scheme: {self.scheme!r}")

        if self.symmetrize not in ("none", "or", "mutual"):
            raise GraphValidationError(f"Unknown symmetrize mode: {self.symmetrize!r}")
        if self.normalize not in ("none", "rw", "sym"):
            raise GraphValidationError(f"Unknown normalize mode: {self.normalize!r}")

        if self.backend not in ("auto", "numpy", "sklearn", "faiss"):
            raise GraphValidationError(f"Unknown backend: {self.backend!r}")

        if int(self.chunk_size) <= 0:
            raise GraphValidationError("chunk_size must be > 0")

        # backend-specific constraints
        if self.backend == "faiss" and self.scheme == "epsilon":
            raise GraphValidationError("faiss backend does not support epsilon scheme")

        if int(self.faiss_hnsw_m) <= 0:
            raise GraphValidationError("faiss_hnsw_m must be > 0")
        if int(self.faiss_ef_search) <= 0:
            raise GraphValidationError("faiss_ef_search must be > 0")
        if int(self.faiss_ef_construction) <= 0:
            raise GraphValidationError("faiss_ef_construction must be > 0")

        self.weights.validate(metric=self.metric)

    def to_dict(self) -> dict[str, Any]:
        return {
            "scheme": self.scheme,
            "metric": self.metric,
            "k": self.k,
            "radius": self.radius,
            "symmetrize": self.symmetrize,
            "weights": self.weights.to_dict(),
            "normalize": self.normalize,
            "self_loops": self.self_loops,
            "backend": self.backend,
            "chunk_size": int(self.chunk_size),
            "feature_field": self.feature_field,
            "n_anchors": self.n_anchors,
            "anchors_k": int(self.anchors_k),
            "anchors_method": self.anchors_method,
            "candidate_limit": int(self.candidate_limit),
            "faiss_exact": bool(self.faiss_exact),
            "faiss_hnsw_m": int(self.faiss_hnsw_m),
            "faiss_ef_search": int(self.faiss_ef_search),
            "faiss_ef_construction": int(self.faiss_ef_construction),
        }

    @classmethod
    def from_dict(cls, d: dict[str, Any]) -> GraphBuilderSpec:
        # keep backward compatibility: missing keys fall back to legacy defaults
        return cls(
            scheme=str(d.get("scheme", "knn")),  # type: ignore[arg-type]
            metric=str(d.get("metric", "cosine")),  # type: ignore[arg-type]
            k=d.get("k"),
            radius=d.get("radius"),
            symmetrize=str(d.get("symmetrize", "mutual")),  # type: ignore[arg-type]
            weights=GraphWeightsSpec.from_dict(dict(d.get("weights", {}))),
            normalize=str(d.get("normalize", "rw")),  # type: ignore[arg-type]
            self_loops=bool(d.get("self_loops", True)),
            backend=str(d.get("backend", "auto")),  # type: ignore[arg-type]
            chunk_size=int(d.get("chunk_size", 512)),
            feature_field=str(d.get("feature_field", "features.X")),
            n_anchors=d.get("n_anchors"),
            anchors_k=int(d.get("anchors_k", 5)),
            anchors_method=str(d.get("anchors_method", "random")),  # type: ignore[arg-type]
            candidate_limit=int(d.get("candidate_limit", 1000)),
            faiss_exact=bool(d.get("faiss_exact", False)),
            faiss_hnsw_m=int(d.get("faiss_hnsw_m", 32)),
            faiss_ef_search=int(d.get("faiss_ef_search", 64)),
            faiss_ef_construction=int(d.get("faiss_ef_construction", 200)),
        )

24.7 `GraphFeaturizerSpec` `dataclass` ¶

Featurization spec to produce inductive views from a graph.

24.7.0.1 Views¶

attr: returns the original attribute matrix X diffusion: returns a simple diffusion of attributes over the graph struct: returns structural embeddings (DeepWalk/Node2Vec-style) computed from the graph only (X is ignored).

24.7.0.2 Notes¶

The struct view is deterministic given the seed.
For large graphs, struct view may require optional dependencies.

Source code in src/modssc/graph/specs.py

@dataclass(frozen=True)
class GraphFeaturizerSpec:
    """Featurization spec to produce inductive views from a graph.

    Views
    -----
    attr:
        returns the original attribute matrix X
    diffusion:
        returns a simple diffusion of attributes over the graph
    struct:
        returns structural embeddings (DeepWalk/Node2Vec-style) computed from the graph
        only (X is ignored).

    Notes
    -----
    - The struct view is deterministic given the seed.
    - For large graphs, struct view may require optional dependencies.
    """

    views: tuple[ViewName, ...] = ("attr",)

    # diffusion
    diffusion_steps: int = 5
    diffusion_alpha: float = 0.1

    # struct
    struct_method: StructMethod = "deepwalk"
    struct_dim: int = 64
    walk_length: int = 40
    num_walks_per_node: int = 10
    window_size: int = 5
    p: float = 1.0
    q: float = 1.0

    cache: bool = True

    def validate(self) -> None:
        if self.diffusion_steps < 0:
            raise GraphValidationError("diffusion_steps must be >= 0")
        if not (0.0 <= float(self.diffusion_alpha) <= 1.0):
            raise GraphValidationError("diffusion_alpha must be in [0, 1]")

        if not self.views:
            raise GraphValidationError("views cannot be empty")
        for v in self.views:
            if v not in ("attr", "diffusion", "struct"):
                raise GraphValidationError(f"Unknown view: {v!r}")

        if self.struct_method not in ("deepwalk", "node2vec"):
            raise GraphValidationError(f"Unknown struct_method: {self.struct_method!r}")
        if int(self.struct_dim) <= 0:
            raise GraphValidationError("struct_dim must be > 0")
        if int(self.walk_length) <= 1:
            raise GraphValidationError("walk_length must be > 1")
        if int(self.num_walks_per_node) <= 0:
            raise GraphValidationError("num_walks_per_node must be > 0")
        if int(self.window_size) <= 0:
            raise GraphValidationError("window_size must be > 0")
        if float(self.p) <= 0:
            raise GraphValidationError("p must be > 0")
        if float(self.q) <= 0:
            raise GraphValidationError("q must be > 0")

    def to_dict(self) -> dict[str, Any]:
        return {
            "views": list(self.views),
            "diffusion_steps": int(self.diffusion_steps),
            "diffusion_alpha": float(self.diffusion_alpha),
            "struct_method": self.struct_method,
            "struct_dim": int(self.struct_dim),
            "walk_length": int(self.walk_length),
            "num_walks_per_node": int(self.num_walks_per_node),
            "window_size": int(self.window_size),
            "p": float(self.p),
            "q": float(self.q),
            "cache": bool(self.cache),
        }

    @classmethod
    def from_dict(cls, d: dict[str, Any]) -> GraphFeaturizerSpec:
        views = tuple(d.get("views", ["attr"]))
        return cls(
            views=views,  # type: ignore[arg-type]
            diffusion_steps=int(d.get("diffusion_steps", 5)),
            diffusion_alpha=float(d.get("diffusion_alpha", 0.1)),
            struct_method=str(d.get("struct_method", "deepwalk")),  # type: ignore[arg-type]
            struct_dim=int(d.get("struct_dim", 64)),
            walk_length=int(d.get("walk_length", 40)),
            num_walks_per_node=int(d.get("num_walks_per_node", 10)),
            window_size=int(d.get("window_size", 5)),
            p=float(d.get("p", 1.0)),
            q=float(d.get("q", 1.0)),
            cache=bool(d.get("cache", True)),
        )

24.8 `GraphWeightsSpec` `dataclass` ¶

Specification for edge weights.

24.8.0.1 Parameters¶

kind: - "binary": all edges weight 1 - "heat": exp(-d^2/(2*sigma^2)) - "cosine": convert cosine distances into similarities (1 - d) sigma: Used only for kind="heat".

Source code in src/modssc/graph/specs.py

@dataclass(frozen=True)
class GraphWeightsSpec:
    """Specification for edge weights.

    Parameters
    ----------
    kind:
        - "binary": all edges weight 1
        - "heat": exp(-d^2/(2*sigma^2))
        - "cosine": convert cosine distances into similarities (1 - d)
    sigma:
        Used only for kind="heat".
    """

    kind: WeightKind = "binary"
    sigma: float | None = None

    def validate(self, *, metric: Metric) -> None:
        if self.kind not in ("binary", "heat", "cosine"):
            raise GraphValidationError(f"Unknown weight kind: {self.kind!r}")
        if self.kind == "heat":
            sigma = float(self.sigma or 0.0)
            if sigma <= 0:
                raise GraphValidationError("sigma must be > 0 for heat weights")
        if self.kind == "cosine" and metric != "cosine":
            raise GraphValidationError("cosine weights require metric='cosine'")

    def to_dict(self) -> dict[str, Any]:
        return {"kind": self.kind, "sigma": self.sigma}

    @classmethod
    def from_dict(cls, d: dict[str, Any]) -> GraphWeightsSpec:
        return cls(kind=str(d.get("kind", "binary")), sigma=d.get("sigma"))

24.9 `NodeDataset` `dataclass` ¶

Node classification dataset for transductive methods.

Source code in src/modssc/graph/artifacts.py

@dataclass(frozen=True)
class NodeDataset:
    """Node classification dataset for transductive methods."""

    X: Any
    y: np.ndarray
    graph: GraphArtifact
    masks: dict[str, np.ndarray] = field(default_factory=dict)
    meta: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self) -> None:
        # Validate X first dimension (works for numpy and scipy sparse)
        if not hasattr(self.X, "shape"):
            raise GraphValidationError("X must expose a shape attribute")
        if int(self.X.shape[0]) != int(self.graph.n_nodes):
            raise GraphValidationError("X must have shape (n_nodes, d)")

        y = _as_int64(self.y)
        if y.ndim not in (1, 2):
            raise GraphValidationError("y must have shape (n,) or (n, C)")
        if y.shape[0] != self.graph.n_nodes:
            raise GraphValidationError("y must have the same first dimension as graph.n_nodes")
        object.__setattr__(self, "y", y)

        new_masks: dict[str, np.ndarray] = {}
        for k, v in self.masks.items():
            m = np.asarray(v, dtype=bool)
            if m.ndim != 1 or m.shape[0] != self.graph.n_nodes:
                raise GraphValidationError(f"Mask {k!r} must have shape (n_nodes,)")
            new_masks[str(k)] = m
        object.__setattr__(self, "masks", new_masks)

24.10 `build_graph(X, *, spec, seed=0, dataset_fingerprint=None, preprocess_fingerprint=None, cache=True, cache_dir=None, edge_shard_size=None, resume=True)` ¶

Build a graph from a dense feature matrix.

24.10.0.1 Parameters¶

X: A 2D dense array-like of shape (n_nodes, n_features). spec: GraphBuilderSpec controlling scheme/backend/weights/normalization. seed: Seed used for deterministic components (notably the anchor scheme). dataset_fingerprint: Optional precomputed fingerprint for X (useful when X is already cached upstream). preprocess_fingerprint: Optional fingerprint of the preprocessing pipeline. cache: Whether to cache the built graph on disk. cache_dir: Override the default cache directory. edge_shard_size: If provided, store the edge arrays in sharded .npz files with at most this many edges per shard. resume: If True and cache=True, partial numpy chunk computations are resumed from the cache entry work directory when available.

24.10.0.2 Returns¶

GraphArtifact

Source code in src/modssc/graph/construction/api.py

def build_graph(
    X: Any,
    *,
    spec: GraphBuilderSpec,
    seed: int = 0,
    dataset_fingerprint: str | None = None,
    preprocess_fingerprint: str | None = None,
    cache: bool = True,
    cache_dir: str | Path | None = None,
    edge_shard_size: int | None = None,
    resume: bool = True,
) -> GraphArtifact:
    """Build a graph from a dense feature matrix.

    Parameters
    ----------
    X:
        A 2D dense array-like of shape (n_nodes, n_features).
    spec:
        GraphBuilderSpec controlling scheme/backend/weights/normalization.
    seed:
        Seed used for deterministic components (notably the anchor scheme).
    dataset_fingerprint:
        Optional precomputed fingerprint for X (useful when X is already cached upstream).
    preprocess_fingerprint:
        Optional fingerprint of the preprocessing pipeline.
    cache:
        Whether to cache the built graph on disk.
    cache_dir:
        Override the default cache directory.
    edge_shard_size:
        If provided, store the edge arrays in sharded `.npz` files with at most this many
        edges per shard.
    resume:
        If True and `cache=True`, partial numpy chunk computations are resumed from the
        cache entry work directory when available.

    Returns
    -------
    GraphArtifact
    """
    start = perf_counter()
    validate_features(X)
    validate_builder_spec(spec)

    X_arr = np.asarray(X)
    n_nodes = int(X_arr.shape[0])

    ds_fp = dataset_fingerprint or fingerprint_array(X_arr)
    spec_fp = fingerprint_dict(spec.to_dict())
    g_fp = _graph_fingerprint(
        dataset_fingerprint=ds_fp,
        preprocess_fingerprint=preprocess_fingerprint,
        spec=spec,
        seed=int(seed),
    )

    cache_store = GraphCache(
        root=Path(cache_dir) if cache_dir is not None else GraphCache.default().root,
        edge_shard_size=edge_shard_size,
    )

    if cache and cache_store.exists(g_fp):
        graph, _ = cache_store.load(g_fp)
        logger.info(
            "Graph cached: fingerprint=%s n_nodes=%s n_edges=%s duration_s=%.3f",
            g_fp,
            graph.n_nodes,
            int(graph.edge_index.shape[1]),
            perf_counter() - start,
        )
        return graph

    # Optional resumable work directory inside the cache entry (only used by numpy backend).
    work_dir: Path | None = None
    if cache and resume:
        work_dir = cache_store.entry_dir(g_fp) / "_work"
        work_dir.mkdir(parents=True, exist_ok=True)

    # Build raw edges + distances
    logger.info(
        "Graph build start: scheme=%s metric=%s backend=%s n_nodes=%s seed=%s",
        spec.scheme,
        spec.metric,
        spec.backend,
        n_nodes,
        seed,
    )
    if spec.scheme == "knn" and spec.k is not None and int(spec.k) <= 1:
        logger.warning("Graph spec k is very small: k=%s", spec.k)
    if spec.scheme == "epsilon" and spec.radius is not None and float(spec.radius) <= 0:
        logger.warning("Graph spec radius is non-positive: radius=%s", spec.radius)

    edge_index, distances = build_raw_edges(
        X_arr,
        spec=spec,
        seed=int(seed),
        work_dir=work_dir,
        resume=bool(resume),
    )

    # Turn distances into weights
    edge_weight = compute_edge_weights(
        distances=distances, weights=spec.weights, metric=spec.metric
    )

    # Post-process graph
    if spec.symmetrize != "none":
        edge_index, edge_weight = symmetrize_edges(
            n_nodes=n_nodes,
            edge_index=edge_index,
            edge_weight=edge_weight,
            mode=spec.symmetrize,
        )

    if spec.self_loops:
        edge_index, edge_weight = add_self_loops(
            n_nodes=n_nodes, edge_index=edge_index, edge_weight=edge_weight
        )

    if spec.normalize != "none":
        edge_weight = normalize_edge_weights(
            n_nodes=n_nodes, edge_index=edge_index, edge_weight=edge_weight, mode=spec.normalize
        )

    if edge_weight is not None and not np.isfinite(edge_weight).all():
        raise GraphValidationError(
            "Non-finite edge weights detected (check input features and spec)"
        )

    graph = GraphArtifact(
        n_nodes=n_nodes,
        edge_index=edge_index,
        edge_weight=edge_weight,
        directed=(spec.symmetrize == "none"),
        meta={
            "fingerprint": g_fp,
            "dataset_fingerprint": ds_fp,
            "preprocess_fingerprint": preprocess_fingerprint,
            "spec_fingerprint": spec_fp,
            "seed": int(seed),
        },
    )

    if cache:
        manifest = {
            "fingerprint": g_fp,
            "dataset_fingerprint": ds_fp,
            "preprocess_fingerprint": preprocess_fingerprint,
            "spec": spec.to_dict(),
            "spec_fingerprint": spec_fp,
            "seed": int(seed),
        }
        cache_store.save(fingerprint=g_fp, graph=graph, manifest=manifest, overwrite=True)

    duration = perf_counter() - start
    logger.info(
        "Graph build done: fingerprint=%s n_nodes=%s n_edges=%s duration_s=%.3f",
        g_fp,
        n_nodes,
        int(edge_index.shape[1]),
        duration,
    )
    if logger.isEnabledFor(logging.DEBUG) and edge_index.size and edge_index.shape[1] <= 5_000_000:
        min_deg, mean_deg, max_deg, zero_deg = _degree_summary(edge_index, n_nodes)
        logger.debug(
            "Graph degrees: min=%s mean=%.2f max=%s zero=%s",
            min_deg,
            mean_deg,
            max_deg,
            zero_deg,
        )
        if n_nodes and zero_deg / float(n_nodes) > 0.2:
            logger.warning("Graph has many isolated nodes: zero_degree=%s", zero_deg)

    return graph

24.11 `graph_to_views(dataset, *, spec, seed=0, cache=None, cache_dir=None)` ¶

Compute one or more views from a (graph, X) dataset.

Source code in src/modssc/graph/featurization/api.py

def graph_to_views(
    dataset: NodeDataset,
    *,
    spec: GraphFeaturizerSpec,
    seed: int = 0,
    cache: bool | None = None,
    cache_dir: str | Path | None = None,
) -> DatasetViews:
    """Compute one or more views from a (graph, X) dataset."""
    start = perf_counter()
    validate_featurizer_spec(spec)

    graph_fp = str(dataset.graph.meta.get("fingerprint", "")) if dataset.graph.meta else ""
    if not graph_fp:
        # fallback: fingerprint of graph structure not available
        graph_fp = fingerprint_dict(
            {
                "n_nodes": int(dataset.graph.n_nodes),
                "edge_index": dataset.graph.edge_index[
                    :2, : min(1000, dataset.graph.edge_index.shape[1])
                ].tolist(),
            }
        )

    views_fp = _views_fingerprint(graph_fingerprint=graph_fp, spec=spec, seed=int(seed))

    cache_enabled = bool(spec.cache) if cache is None else bool(cache)
    cache_store = ViewsCache(
        root=Path(cache_dir) if cache_dir is not None else ViewsCache.default().root
    )

    if cache_enabled and cache_store.exists(views_fp):
        cached, _ = cache_store.load(views_fp, y=np.asarray(dataset.y), masks=dataset.masks)
        logger.info(
            "Graph views cached: fingerprint=%s views=%s duration_s=%.3f",
            views_fp,
            list(spec.views),
            perf_counter() - start,
        )
        return cached

    views: dict[str, np.ndarray] = {}
    for name in spec.views:
        step_start = perf_counter()
        if name == "attr":
            views["attr"] = attr_view(dataset.X)
        elif name == "diffusion":
            views["diffusion"] = diffusion_view(
                X=np.asarray(dataset.X),
                n_nodes=int(dataset.graph.n_nodes),
                edge_index=np.asarray(dataset.graph.edge_index),
                edge_weight=(
                    np.asarray(dataset.graph.edge_weight)
                    if dataset.graph.edge_weight is not None
                    else None
                ),
                steps=int(spec.diffusion_steps),
                alpha=float(spec.diffusion_alpha),
            )
        elif name == "struct":
            sp = StructParams(
                method=spec.struct_method,
                dim=int(spec.struct_dim),
                walk_length=int(spec.walk_length),
                num_walks_per_node=int(spec.num_walks_per_node),
                window_size=int(spec.window_size),
                p=float(spec.p),
                q=float(spec.q),
            )
            views["struct"] = struct_embeddings(
                edge_index=dataset.graph.edge_index,
                n_nodes=int(dataset.graph.n_nodes),
                params=sp,
                seed=int(seed),
            )
        else:
            raise ValueError(f"Unknown view: {name!r}")
        logger.debug("Graph view built: name=%s duration_s=%.3f", name, perf_counter() - step_start)

    # validate output
    for k, v in views.items():
        validate_view_matrix(v, n_nodes=int(dataset.graph.n_nodes), name=k)

    out = DatasetViews(
        views=views,
        y=np.asarray(dataset.y),
        masks=dataset.masks,
        meta={
            "fingerprint": views_fp,
            "graph_fingerprint": graph_fp,
            "spec_fingerprint": fingerprint_dict(spec.to_dict()),
            "seed": int(seed),
        },
    )

    if cache_enabled:
        manifest = {
            "fingerprint": views_fp,
            "graph_fingerprint": graph_fp,
            "spec": spec.to_dict(),
            "spec_fingerprint": out.meta.get("spec_fingerprint"),
            "seed": int(seed),
        }
        cache_store.save(fingerprint=views_fp, views=out, manifest=manifest)

    logger.info(
        "Graph views done: fingerprint=%s views=%s duration_s=%.3f",
        views_fp,
        list(spec.views),
        perf_counter() - start,
    )
    return out

Sources

24. Graph API¶

24.1 What it is for¶

24.2 Examples¶

24.3 API reference¶

24.4 DatasetViews dataclass ¶

24.5 GraphArtifact dataclass ¶

24.5.0.1 Notes¶

24.6 GraphBuilderSpec dataclass ¶

24.6.0.1 Notes¶

24.7 GraphFeaturizerSpec dataclass ¶

24.7.0.1 Views¶

24.7.0.2 Notes¶

24.8 GraphWeightsSpec dataclass ¶

24.8.0.1 Parameters¶

24.9 NodeDataset dataclass ¶

24.10 build_graph(X, *, spec, seed=0, dataset_fingerprint=None, preprocess_fingerprint=None, cache=True, cache_dir=None, edge_shard_size=None, resume=True) ¶

24.10.0.1 Parameters¶

24.10.0.2 Returns¶

24.11 graph_to_views(dataset, *, spec, seed=0, cache=None, cache_dir=None) ¶

24.4 `DatasetViews` `dataclass` ¶

24.5 `GraphArtifact` `dataclass` ¶

24.6 `GraphBuilderSpec` `dataclass` ¶

24.7 `GraphFeaturizerSpec` `dataclass` ¶

24.8 `GraphWeightsSpec` `dataclass` ¶

24.9 `NodeDataset` `dataclass` ¶

24.10 `build_graph(X, *, spec, seed=0, dataset_fingerprint=None, preprocess_fingerprint=None, cache=True, cache_dir=None, edge_shard_size=None, resume=True)` ¶

24.11 `graph_to_views(dataset, *, spec, seed=0, cache=None, cache_dir=None)` ¶