M.A. Thesis · UC Santa Cruz · 2025
UMAP: Uniform Manifold
Approximation & Projection

An extension of the UMAP dimensionality reduction algorithm that incorporates 2-simplices, improving topological fidelity on manifold-structured data.

Thesis Advisor Marty Weissman · UC Santa Cruz

1-Simplex vs 2-Simplex Construction

Standard UMAP · Thesis Extension
Standard UMAP — 1-Skeleton (edges only)
Thesis Extension — 2-Skeleton (triangles)
1-Skeleton · Fuzzy Graph
0-simplices (vertices) + 1-simplices (edges)
Captures $H_0$ — connected components
2-Skeleton · Fuzzy Simplicial Complex
+ 2-simplices (filled triangles)
Also captures $H_1$ — loops & cycles
Dataset
k-neighbors
Drag to rotate · hover a point to highlight its neighborhood

Mathematical Framework

Topology · Manifold Learning
01
The Manifold Hypothesis

Real high-dimensional data doesn't fill its ambient space uniformly — it concentrates near a low-dimensional manifold. A 64×64 image lives in ℝ4096, but the set of natural images occupies a manifold of far smaller intrinsic dimension. Dimensionality reduction exploits this: if the true intrinsic dimension is d ≪ D, we can represent the data faithfully in ℝd while preserving its geometric and topological structure.

02
Riemannian Geometry — Local Metric Estimation

UMAP begins by assuming the data lies on a Riemannian manifold with an unknown metric. It estimates the local metric structure from the k-nearest neighbors of each point. For each point xi, the metric is normalized so that the distance to the nearest neighbor is exactly 1:

$$d_i(x_i, x_j) \;=\; \|x_i - x_j\| - \rho_i$$

where ρi is the distance to the nearest neighbor. This normalization makes the metric locally constant — every point "sees" its neighborhood at the same scale — which is essential for handling datasets with varying densities.

03
Fuzzy Simplicial Sets — The Topological Representation

The local metrics are converted into a fuzzy simplicial set — a generalization of a simplicial complex (think: a graph, triangles, tetrahedra...) where membership has a probabilistic weight between 0 and 1. The weight of the edge between points xi and xj is:

$$w(x_i,\, x_j) \;=\; \exp\!\left(\frac{-(d_i(x_i, x_j) - \rho_i)}{\sigma_i}\right)$$

where σi is chosen so the sum of weights equals log2(k) — enforcing a consistent notion of "k neighbors." Multiple local views (one per point) are combined into a single global fuzzy set via a symmetric union:

$$w_{\mathrm{sym}}(i,j) \;=\; w(i,j) + w(j,i) - w(i,j)\cdot w(j,i)$$

This is the fuzzy set analogue of taking the union: P(A ∪ B) = P(A) + P(B) - P(A)P(B) for independent events.

04
Low-Dimensional Embedding via Cross-Entropy Optimization

UMAP finds a low-dimensional layout Y ⊂ ℝd by minimizing the cross-entropy between the high-dimensional fuzzy set and a low-dimensional one whose edge weights are modeled as $v(y_i, y_j) = (1 + a\|y_i - y_j\|^{2b})^{-1}$:

$$\mathcal{C} \;=\; \sum_{i,j}\Bigl[w\log\frac{w}{v} + (1-w)\log\frac{1-w}{1-v}\Bigr]$$

Minimized via stochastic gradient descent with negative sampling: attract connected points, repel disconnected ones. Unlike t-SNE's KL divergence, the cross-entropy treats attraction and repulsion symmetrically — which is why UMAP better preserves global structure.

05
The Role of 0- and 1-Simplices

A 0-simplex is a point — one data observation, with membership strength 1. The local metric normalization (subtracting the nearest-neighbor distance ρi) ensures every 0-simplex is the face of at least one 1-simplex with weight 1, guaranteeing local connectivity throughout the complex.

A 1-simplex is a weighted edge between two points. Its weight $w(x_i, x_j) = \exp({-\max(0,\, d - \rho_i)}/{\sigma_i})$ encodes how strongly $x_j$ belongs to $x_i$'s neighborhood. These weighted edges form the fuzzy 1-skeleton — the graph UMAP optimizes over. Standard UMAP stops here, discarding higher-order simplices for tractability. The cost: $H_0$ (cluster structure) is encoded directly, but $H_1$ (loops, holes) can only be inferred — the graph cannot distinguish a genuine hollow loop from a dense cluster whose triangles would fill it in. That gap is what this thesis addresses.

AnchorUMAP — Results

Geometric · Topological · Three Datasets
Evaluated on a circle (S¹), a 2D wavy manifold, and PBMC3k single-cell RNA data. Four metrics: trustworthiness and Spearman's ρ for geometry (higher = better), Wasserstein and bottleneck distance from Ripser persistence diagrams for topology (lower = better). Key finding: AnchorUMAP improves topological fidelity on manifold-structured data without harming local geometry.
Geometric metrics
trustworthiness · Spearman ρ  ·  higher is better
Topological distortion — Wasserstein & Bottleneck (lower is better · k=10)
How faithfully does the embedding preserve the shape of the data? These distances compare persistence diagrams of the original vs. embedded data — smaller means the topology was preserved more accurately.
Circle (S¹)
Wasserstein
8.30 5.03
−39%
Bottleneck
1.80 0.50
−72%
Trustworthiness holds at 1.000. Circular topology is substantially more faithful.
2D Manifold · best case
Wasserstein
14.18 9.81
−31%
Bottleneck
0.72 0.20
−72%
Spearman's ρ also improves (+0.031 at k=10). Genuine 2D structure → genuine gains.
PBMC3k · expected null
Wasserstein
505.9 502.6
≈ no change
Bottleneck
2.03 2.03
≈ no change
No 2D manifold structure → anchor mechanism correctly does nothing.
Finding 01 · Circle
Local geometry intact, topology improved
Trustworthiness is exactly 1.000 for both methods at all k — no spurious neighbors introduced. Spearman's ρ drops slightly, a modest global trade-off. The bottleneck distance falls 72%, meaning the circular loop is recovered far more cleanly.
Finding 02 · 2D Manifold
Strongest case — k=10 optimal
The wavy surface is genuinely 2D, so triangle-finding identifies real structure. At k=10, both geometric and topological metrics improve. At k=20, spurious triangles accumulate and performance regresses — confirming that triangle quality filtering matters.
Finding 03 · PBMC3k
No improvement — as expected
Gene expression data forms clusters, not surfaces. AnchorUMAP finds no quality triangles and produces identical metrics. This is the correct outcome — the algorithm should be inert when no 2-simplicial structure exists.
UMAP Engine |M.A. Mathematics · UC Santa Cruz |