Technical Overview: Predicting Neural Latent Dynamics from Visual Stimuli

Introduction

This project develops a supervised learning framework for predicting region-specific neural population dynamics from natural visual stimuli. The central objective is to construct a parametric mapping from visual input to low-dimensional latent coordinates that capture the geometric organization of multi-region neural activity. This approach integrates dimensionality reduction, topological data analysis, and deep learning to model how sensory input drives neural trajectories across the brain's representational manifolds.

The framework treats neural population activity not merely as high-dimensional vectors, but as points on region-specific manifolds embedded in neural state space. By learning to predict these manifold coordinates from stimulus features, we construct an interpretable model of sensory-to-neural transformation that respects the intrinsic geometric structure discovered through topological analysis.

1. Data Definition and Acquisition

Our analysis uses the Allen Institute Visual Coding Neuropixels dataset, which provides simultaneous recordings from multiple brain regions during presentation of natural movie stimuli. We formalize the data structure as follows.

Let $S(t) \in \mathcal{S}$ denote the visual stimulus at time $t$, where $\mathcal{S}$ represents the space of natural image frames. For each brain region $r$ from a set of regions $\mathcal{R}$ (including primary visual cortex VISp, lateral visual areas VISl/VISal/VISam, thalamic nuclei LP/PO/VPM, hippocampal regions CA1/CA3/DG, and others), we observe spike trains from $N_r$ simultaneously recorded neurons.

To construct a population-level representation, we convert discrete spike events into continuous firing-rate vectors using a sliding temporal window. Formally, let $s_i(t)$ denote the spike train of neuron $i$ in region $r$, represented as a sum of Dirac delta functions. The population firing-rate vector at time $t$ is defined as:

$$X_r(t) = \frac{1}{\Delta} \left[ \int_{t}^{t+\Delta} s_1(\tau) \, d\tau, \, \ldots, \, \int_{t}^{t+\Delta} s_{N_r}(\tau) \, d\tau \right]^T \in \mathbb{R}^{N_r}$$

where $\Delta$ is the temporal window width (typically 100–200 ms, chosen to balance temporal resolution and statistical reliability). Windows are computed with stride $\delta$ (often $\delta = \Delta/2$ for overlapping windows), producing a time series of population activity snapshots $\{X_r(t_j)\}$ aligned to discrete time points $t_j = j \cdot \delta$.

This construction transforms event-based spike data into a continuous trajectory through neural state space, enabling subsequent geometric and topological analysis.

2. Normalization and Variance Stabilization

Neural firing rates exhibit substantial heterogeneity: some neurons fire at high baseline rates while others respond sparsely. To ensure each neuron contributes proportionally to downstream analysis, we apply z-score normalization across time:

$$\tilde{X}_r(t) = \frac{X_r(t) - \mu_r}{\sigma_r}$$

where $\mu_r \in \mathbb{R}^{N_r}$ and $\sigma_r \in \mathbb{R}^{N_r}$ are the temporal mean and standard deviation vectors computed over the entire recording session. This operation centers each neuron's activity at zero and scales by its variability, producing normalized activity vectors $\tilde{X}_r(t) \in \mathbb{R}^{N_r}$ with comparable variance across neurons.

For count-based data, variance stabilization techniques such as square-root transformation or Anscombe transform may be applied before z-scoring to mitigate Poisson noise characteristics. In practice, z-scoring alone is often sufficient for spike-count data with adequate temporal windows.

3. Latent Space Construction via Dimensionality Reduction

Neural population vectors $\tilde{X}_r(t) \in \mathbb{R}^{N_r}$ are high-dimensional (with $N_r$ ranging from tens to hundreds), but neural activity typically resides on a lower-dimensional manifold reflecting shared response patterns and correlational structure. We perform linear dimensionality reduction to construct latent coordinates:

$$Z_r(t) = W_r^T \tilde{X}_r(t) \in \mathbb{R}^k$$

where $W_r \in \mathbb{R}^{N_r \times k}$ is a projection matrix learned via Principal Component Analysis (PCA) or Factor Analysis, and $k \ll N_r$ (typically $k = 3$ to $10$). The columns of $W_r$ form an orthonormal basis spanning the subspace of maximal variance in the normalized firing-rate data.

Pooled PCA for Shared Coordinate Systems

A critical design choice is whether to learn $W_r$ independently for each recording session or to pool data across sessions. Independent PCA per session produces session-specific coordinate systems that are related only by arbitrary rotations, making cross-session comparison and supervised learning infeasible.

Instead, we employ pooled PCA: concatenate normalized activity matrices $\{\tilde{X}_r^{(s)}(t)\}$ across all sessions $s \in \mathcal{S}$ and compute a single, shared projection matrix $W_r$ for each region. This yields a canonical coordinate system for region $r$ that is consistent across all recording sessions and animals. The shared latent space $Z_r(t)$ enables direct comparison of neural trajectories across sessions and, crucially, supports supervised learning where training and test data may come from different sessions.

The resulting latent coordinates $Z_r(t)$ provide a geometric parameterization of neural population state, forming the target variables for supervised prediction.

4. Topological Motivation: From Invariants to Coordinates

Before proceeding to predictive modeling, we employ topological data analysis to characterize the intrinsic geometric structure of the latent space $\{Z_r(t)\}$. This step motivates the choice of continuous latent coordinates over discrete topological invariants as prediction targets.

Persistent Homology

Given a point cloud $\mathcal{P}_r = \{Z_r(t_j)\}_{j=1}^{M} \subset \mathbb{R}^k$, we construct the Vietoris–Rips complex $\text{VR}_\epsilon(\mathcal{P}_r)$ at scale $\epsilon$: a simplicial complex where a $p$-simplex is included if all pairwise distances between its $(p+1)$ vertices are at most $\epsilon$. As $\epsilon$ increases from $0$ to $\infty$, the topology of $\text{VR}_\epsilon$ evolves, with topological features (connected components, loops, voids) appearing and disappearing.

Persistent homology tracks these features across scales. For each homological dimension $p$, we compute the $p$-th homology group $H_p(\text{VR}_\epsilon)$, whose rank $\beta_p$ (the $p$-th Betti number) counts independent $p$-dimensional holes:

$H_0$: Connected components (0-dimensional features)
$H_1$: Loops or cycles (1-dimensional holes)
$H_2$: Voids or cavities (2-dimensional holes)

Each topological feature $\alpha$ is characterized by its birth scale $b_\alpha$ and death scale $d_\alpha$, forming a persistence diagram $D_r = \{(b_i, d_i)\}$. Features with long persistence ($d_i - b_i \gg 0$) represent robust geometric structures, while short-lived features reflect noise.

Why Latent Coordinates Instead of Topological Invariants?

Persistent homology reveals that neural manifolds across brain regions exhibit diverse topological structures. Some regions show transient $H_1$ loops corresponding to cyclic dynamics (e.g., periodic responses to repeated stimulus features), while others display predominantly tree-like structures with minimal loop topology. Importantly, these topological features are:

Region-specific: Different brain areas have distinct topological signatures, reflecting specialized computational roles.
Stimulus-dependent: Loop formation and dissolution depend on which stimulus features are presented.
Transient: Topological features appear and disappear dynamically, rather than being static properties.

While persistence diagrams $D_r$ provide valuable insight into manifold geometry, they do not directly support continuous prediction. Topological invariants are discrete, combinatorial summaries that do not capture the precise position of neural activity within the geometric structure. To predict where the brain state lies on its manifold in response to a given stimulus, we need a continuous parameterization.

Thus, we adopt latent coordinates $Z_r(t) \in \mathbb{R}^k$ as the prediction target. These coordinates provide a geometric embedding that:

Respects the topology revealed by persistent homology
Parameterizes position on the neural manifold continuously
Generalizes across different topological structures (trees, loops, higher-dimensional manifolds)
Supports gradient-based optimization for supervised learning

In essence, topology reveals the structure of neural state space, while latent coordinates provide a navigable representation suitable for prediction and interpretation.

5. Stimulus–Neural Alignment

To construct supervised training pairs, we must align visual stimulus frames $S(t)$ with neural latent coordinates $Z_r(t)$. The stimulus is presented at discrete frame times $\{t_i^S\}$, while neural windows are centered at times $\{t_j^N\}$. We define the alignment mapping:

$$f_j = \mathop{\arg\min}_{i} \left| t_i^S - t_j^N \right|$$

assigning each neural window to its nearest stimulus frame. This produces paired samples $\{(S(f_j), Z_r(t_j))\}$ for supervised learning.

Accounting for Neural Latency

Visual information processing incurs a delay between stimulus presentation and peak neural response, typically ranging from 50 to 200 ms depending on brain region and processing hierarchy. To account for this latency $\tau_r$, we modify the alignment to:

$$f_j = \mathop{\arg\min}_{i} \left| t_i^S - (t_j^N - \tau_r) \right|$$

effectively shifting neural responses forward in time by $\tau_r$ before alignment. Optimal latency can be determined empirically through cross-correlation analysis or learned as a hyperparameter during model training.

6. Supervised Learning Framework

Our goal is to learn a parametric function that predicts region-specific latent coordinates from stimulus features. We first embed the visual stimulus into a feature space using a pretrained convolutional neural network or vision-language model:

$$\phi(S(t)) \in \mathbb{R}^d$$

where $\phi: \mathcal{S} \to \mathbb{R}^d$ maps raw image frames to $d$-dimensional feature vectors. Common choices include CLIP embeddings, ResNet features, or ImageNet-pretrained CNN activations.

Multi-Region Prediction Architecture

We employ a shared trunk with region-specific heads architecture. A shared encoder processes stimulus embeddings into a common high-level representation:

$$h(t) = G(\phi(S(t)); \theta_{\text{shared}}) \in \mathbb{R}^{d'}$$

where $G$ is a multi-layer perceptron or transformer encoder with parameters $\theta_{\text{shared}}$. Each brain region $r$ then has a dedicated prediction head:

$$\hat{Z}_r(t) = H_r(h(t); \theta_r) \in \mathbb{R}^k$$

where $H_r$ is a shallow network (often a single linear layer) mapping the shared representation to region-specific latent coordinates.

This architecture shares visual feature extraction across regions while allowing specialized mapping to region-specific manifold coordinates, reflecting the hypothesis that different brain areas extract distinct information from common sensory input.

Loss Function and Masked Regression

We train the model using mean squared error in latent space, aggregated across all regions and time points:

$$\mathcal{L}(\theta) = \sum_{r \in \mathcal{R}} \sum_{j} m_r(t_j) \left\| Z_r(t_j) - \hat{Z}_r(t_j) \right\|_2^2$$

where $m_r(t_j) \in \{0, 1\}$ is a binary mask indicating whether region $r$ was recorded at time $t_j$. This masking handles missing data due to partial recordings: not all regions are recorded in every session, and the mask ensures we only compute loss for observed region-time pairs.

The model parameters $\{\theta_{\text{shared}}, \theta_r\}$ are optimized via stochastic gradient descent to minimize $\mathcal{L}$, learning to predict the geometric position of neural population activity on region-specific manifolds directly from visual input.

7. Conceptual Interpretation: Stimulus-Driven Manifold Traversal

The learned model embodies a specific view of brain function: neural population activity forms trajectories on low-dimensional manifolds, and sensory stimuli induce systematic mappings into these manifolds.

For each region $r$, the latent space $\mathbb{R}^k$ parameterizes a manifold $\mathcal{M}_r \subset \mathbb{R}^k$ on which neural dynamics unfold. The stimulus defines a function:

$$\Phi_r : \mathcal{S} \to \mathcal{M}_r$$

$$S(t) \mapsto Z_r(t)$$

representing the neural response to sensory input as a point on the region's representational manifold. Our predictive model $\hat{Z}_r(t) = F_r(\phi(S(t)))$ approximates this mapping, learning how visual features drive neural trajectories.

From Topology to Geometry

Topological data analysis reveals the qualitative structure of $\mathcal{M}_r$—whether it contains loops, voids, or exhibits tree-like branching. Persistent homology provides invariants that classify these structures up to continuous deformation. However, topology alone does not specify the metric geometry: two manifolds with identical topology may have vastly different shapes, distances, and coordinate representations.

By predicting continuous latent coordinates rather than topological invariants, we move from qualitative structure to quantitative geometry. This transition is essential for:

Generalization: Continuous coordinates smoothly interpolate between observed states, enabling prediction for novel stimuli.
Interpretability: Coordinates provide explicit positions on the manifold, which can be visualized and analyzed.
Flexibility: The framework accommodates diverse topological structures (trees, loops, higher-dimensional forms) within a unified coordinate representation.

The topological analysis serves as a discovery phase, revealing the intrinsic dimensionality and structural constraints of neural dynamics. The supervised learning phase then constructs a predictive model that respects these constraints while mapping stimuli to their corresponding manifold positions.

Conclusion

This pipeline synthesizes dimensionality reduction, topological data analysis, and deep learning into a unified framework for understanding sensory-to-neural transformation. By treating brain activity as trajectories on geometric manifolds and learning to predict manifold coordinates from stimuli, we construct an interpretable, geometry-respecting model of neural computation. The approach reveals how visual input drives population dynamics across distributed brain regions, each characterized by its own intrinsic representational geometry, and provides a foundation for understanding the brain's high-dimensional computational landscape.

← Back to Topology Overview