UMAP Thesis — Nicholas McBride

1-Simplex vs 2-Simplex Construction

Standard UMAP · Thesis Extension

Standard UMAP — 1-Skeleton (edges only)

Thesis Extension — 2-Skeleton (triangles)

1-Skeleton · Fuzzy Graph

0-simplices (vertices) + 1-simplices (edges)
Captures $H_0$ — connected components

2-Skeleton · Fuzzy Simplicial Complex

+ 2-simplices (filled triangles)
Also captures $H_1$ — loops & cycles

Dataset

k-neighbors

Drag to rotate · hover a point to highlight its neighborhood

Mathematical Framework

Topology · Manifold Learning

The Manifold Hypothesis

Real high-dimensional data doesn't fill its ambient space uniformly — it concentrates near a low-dimensional manifold. A 64×64 image lives in ℝ⁴⁰⁹⁶, but the set of natural images occupies a manifold of far smaller intrinsic dimension. Dimensionality reduction exploits this: if the true intrinsic dimension is d ≪ D, we can represent the data faithfully in ℝ^d while preserving its geometric and topological structure.

Riemannian Geometry — Local Metric Estimation

UMAP begins by assuming the data lies on a Riemannian manifold with an unknown metric. It estimates the local metric structure from the k-nearest neighbors of each point. For each point x_i, the metric is normalized so that the distance to the nearest neighbor is exactly 1:

$$d_i(x_i, x_j) \;=\; \|x_i - x_j\| - \rho_i$$

where ρ_i is the distance to the nearest neighbor. This normalization makes the metric locally constant — every point "sees" its neighborhood at the same scale — which is essential for handling datasets with varying densities.

Fuzzy Simplicial Sets — The Topological Representation

The local metrics are converted into a fuzzy simplicial set — a generalization of a simplicial complex (think: a graph, triangles, tetrahedra...) where membership has a probabilistic weight between 0 and 1. The weight of the edge between points x_i and x_j is:

$$w(x_i,\, x_j) \;=\; \exp\!\left(\frac{-(d_i(x_i, x_j) - \rho_i)}{\sigma_i}\right)$$

where σ_i is chosen so the sum of weights equals log₂(k) — enforcing a consistent notion of "k neighbors." Multiple local views (one per point) are combined into a single global fuzzy set via a symmetric union:

$$w_{\mathrm{sym}}(i,j) \;=\; w(i,j) + w(j,i) - w(i,j)\cdot w(j,i)$$

This is the fuzzy set analogue of taking the union: P(A ∪ B) = P(A) + P(B) - P(A)P(B) for independent events.

Low-Dimensional Embedding via Cross-Entropy Optimization

UMAP finds a low-dimensional layout Y ⊂ ℝ^d by minimizing the cross-entropy between the high-dimensional fuzzy set and a low-dimensional one whose edge weights are modeled as $v(y_i, y_j) = (1 + a\|y_i - y_j\|^{2b})^{-1}$:

$$\mathcal{C} \;=\; \sum_{i,j}\Bigl[w\log\frac{w}{v} + (1-w)\log\frac{1-w}{1-v}\Bigr]$$

Minimized via stochastic gradient descent with negative sampling: attract connected points, repel disconnected ones. Unlike t-SNE's KL divergence, the cross-entropy treats attraction and repulsion symmetrically — which is why UMAP better preserves global structure.

The Role of 0- and 1-Simplices

A 0-simplex is a point — one data observation, with membership strength 1. The local metric normalization (subtracting the nearest-neighbor distance ρ_i) ensures every 0-simplex is the face of at least one 1-simplex with weight 1, guaranteeing local connectivity throughout the complex.

A 1-simplex is a weighted edge between two points. Its weight $w(x_i, x_j) = \exp({-\max(0,\, d - \rho_i)}/{\sigma_i})$ encodes how strongly $x_j$ belongs to $x_i$'s neighborhood. These weighted edges form the fuzzy 1-skeleton — the graph UMAP optimizes over. Standard UMAP stops here, discarding higher-order simplices for tractability. The cost: $H_0$ (cluster structure) is encoded directly, but $H_1$ (loops, holes) can only be inferred — the graph cannot distinguish a genuine hollow loop from a dense cluster whose triangles would fill it in. That gap is what this thesis addresses.

AnchorUMAP — Results

Geometric · Topological · Three Datasets

Evaluated on a circle (S¹), a 2D wavy manifold, and PBMC3k single-cell RNA data. Four metrics: trustworthiness and Spearman's ρ for geometry (higher = better), Wasserstein and bottleneck distance from Ripser persistence diagrams for topology (lower = better). Key finding: AnchorUMAP improves topological fidelity on manifold-structured data without harming local geometry.

Geometric metrics

trustworthiness · Spearman ρ · higher is better

Topological distortion — Wasserstein & Bottleneck (lower is better · k=10)

How faithfully does the embedding preserve the shape of the data? These distances compare persistence diagrams of the original vs. embedded data — smaller means the topology was preserved more accurately.

Circle (S¹)

Wasserstein

8.30 → 5.03

−39%

Bottleneck

1.80 → 0.50

−72%

Trustworthiness holds at 1.000. Circular topology is substantially more faithful.

2D Manifold · best case

Wasserstein

14.18 → 9.81

−31%

Bottleneck

0.72 → 0.20

−72%

Spearman's ρ also improves (+0.031 at k=10). Genuine 2D structure → genuine gains.

PBMC3k · expected null

Wasserstein

505.9 → 502.6

≈ no change

Bottleneck

2.03 → 2.03

≈ no change

No 2D manifold structure → anchor mechanism correctly does nothing.

Finding 01 · Circle

Local geometry intact, topology improved

Trustworthiness is exactly 1.000 for both methods at all k — no spurious neighbors introduced. Spearman's ρ drops slightly, a modest global trade-off. The bottleneck distance falls 72%, meaning the circular loop is recovered far more cleanly.

Finding 02 · 2D Manifold

Strongest case — k=10 optimal

The wavy surface is genuinely 2D, so triangle-finding identifies real structure. At k=10, both geometric and topological metrics improve. At k=20, spurious triangles accumulate and performance regresses — confirming that triangle quality filtering matters.

Finding 03 · PBMC3k

No improvement — as expected

Gene expression data forms clusters, not surfaces. AnchorUMAP finds no quality triangles and produces identical metrics. This is the correct outcome — the algorithm should be inert when no 2-simplicial structure exists.