Captures $H_0$ — connected components
Also captures $H_1$ — loops & cycles
An extension of the UMAP dimensionality reduction algorithm that incorporates 2-simplices, improving topological fidelity on manifold-structured data.
Real high-dimensional data doesn't fill its ambient space uniformly — it concentrates near a low-dimensional manifold. A 64×64 image lives in ℝ4096, but the set of natural images occupies a manifold of far smaller intrinsic dimension. Dimensionality reduction exploits this: if the true intrinsic dimension is d ≪ D, we can represent the data faithfully in ℝd while preserving its geometric and topological structure.
UMAP begins by assuming the data lies on a Riemannian manifold with an unknown metric. It estimates the local metric structure from the k-nearest neighbors of each point. For each point xi, the metric is normalized so that the distance to the nearest neighbor is exactly 1:
where ρi is the distance to the nearest neighbor. This normalization makes the metric locally constant — every point "sees" its neighborhood at the same scale — which is essential for handling datasets with varying densities.
The local metrics are converted into a fuzzy simplicial set — a generalization of a simplicial complex (think: a graph, triangles, tetrahedra...) where membership has a probabilistic weight between 0 and 1. The weight of the edge between points xi and xj is:
where σi is chosen so the sum of weights equals log2(k) — enforcing a consistent notion of "k neighbors." Multiple local views (one per point) are combined into a single global fuzzy set via a symmetric union:
This is the fuzzy set analogue of taking the union: P(A ∪ B) = P(A) + P(B) - P(A)P(B) for independent events.
UMAP finds a low-dimensional layout Y ⊂ ℝd by minimizing the cross-entropy between the high-dimensional fuzzy set and a low-dimensional one whose edge weights are modeled as $v(y_i, y_j) = (1 + a\|y_i - y_j\|^{2b})^{-1}$:
Minimized via stochastic gradient descent with negative sampling: attract connected points, repel disconnected ones. Unlike t-SNE's KL divergence, the cross-entropy treats attraction and repulsion symmetrically — which is why UMAP better preserves global structure.
A 0-simplex is a point — one data observation, with membership strength 1. The local metric normalization (subtracting the nearest-neighbor distance ρi) ensures every 0-simplex is the face of at least one 1-simplex with weight 1, guaranteeing local connectivity throughout the complex.
A 1-simplex is a weighted edge between two points. Its weight $w(x_i, x_j) = \exp({-\max(0,\, d - \rho_i)}/{\sigma_i})$ encodes how strongly $x_j$ belongs to $x_i$'s neighborhood. These weighted edges form the fuzzy 1-skeleton — the graph UMAP optimizes over. Standard UMAP stops here, discarding higher-order simplices for tractability. The cost: $H_0$ (cluster structure) is encoded directly, but $H_1$ (loops, holes) can only be inferred — the graph cannot distinguish a genuine hollow loop from a dense cluster whose triangles would fill it in. That gap is what this thesis addresses.