Perceptron

The simplest neural network — a single linear classifier.

Supervised Classification 1958 — Rosenblatt

Forward Pass

Given input vector $\mathbf{x} \in \mathbb{R}^n$, weight vector $\mathbf{w} \in \mathbb{R}^n$, and bias $b$:

z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$ $$\hat{y} = \sigma(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}

Learning Rule

For a training sample $(\mathbf{x}, y)$ with learning rate $\eta$:

\mathbf{w} \leftarrow \mathbf{w} + \eta \,(y - \hat{y})\,\mathbf{x}$$ $$b \leftarrow b + \eta \,(y - \hat{y})

Convergence Theorem

If the training data is linearly separable with margin $\gamma = \min_i \frac{y_i(\mathbf{w}^{*\top}\mathbf{x}_i)}{\|\mathbf{w}^*\|}$, the perceptron converges in at most $\left(\frac{R}{\gamma}\right)^2$ updates, where $R = \max_i \|\mathbf{x}_i\|$.

x₁ ──w₁──╮ x₂ ──w₂──┤→ Σ + b → step(·) → ŷ x₃ ──w₃──╯

Multi-Layer Perceptron (MLP)

Feedforward network with one or more hidden layers — a universal function approximator.

Supervised Classification / Regression Universal Approximation

Architecture

An MLP with $L$ layers maps input $\mathbf{x}$ through a series of affine transformations and nonlinearities:

\mathbf{h}^{(0)} = \mathbf{x}$$ $$\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad l = 1, \dots, L$$ $$\mathbf{h}^{(l)} = f\!\left(\mathbf{z}^{(l)}\right), \quad l = 1, \dots, L-1$$ $$\hat{\mathbf{y}} = g\!\left(\mathbf{z}^{(L)}\right)

Where $f$ is a hidden activation (e.g. ReLU) and $g$ is the output activation (e.g. softmax for classification, identity for regression).

Universal Approximation Theorem

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$, given a non-polynomial activation function.

\forall\, \varepsilon > 0,\;\exists\, N,\; \mathbf{W}, \mathbf{b}:\quad \sup_{\mathbf{x} \in K} \left| f(\mathbf{x}) - \sum_{i=1}^{N} v_i \,\sigma\!\left(\mathbf{w}_i^\top \mathbf{x} + b_i\right) \right| < \varepsilon

Loss Functions

Mean Squared Error (Regression)

\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}\|\mathbf{y}_i - \hat{\mathbf{y}}_i\|^2

Cross-Entropy (Classification)

\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\,\log\hat{y}_{i,c}

Input Hidden 1 Hidden 2 Output ○─────╲ ○──────●──────●──────● ○──────●──────●──────●──────○ ŷ ○──────●──────●──────● ○─────╱

Activation Functions

Nonlinearities that give neural networks their expressive power.

Name	Formula $f(z)$	Derivative $f'(z)$
Sigmoid	$\frac{1}{1+e^{-z}}$	$f(z)(1-f(z))$
Tanh	$\frac{e^z - e^{-z}}{e^z + e^{-z}}$	$1 - f(z)^2$
ReLU	$\max(0, z)$	$\begin{cases}1 & z>0\\0 & z\leq 0\end{cases}$
Leaky ReLU	$\max(\alpha z, z)$	$\begin{cases}1 & z>0\\\alpha & z\leq 0\end{cases}$
ELU	$\begin{cases}z & z>0\\\alpha(e^z-1) & z\leq 0\end{cases}$	$\begin{cases}1 & z>0\\f(z)+\alpha & z\leq 0\end{cases}$
GELU	$z \cdot \Phi(z)$	$\Phi(z) + z\,\phi(z)$
Swish / SiLU	$z \cdot \sigma(z)$	$f(z) + \sigma(z)(1 - f(z))$
Softmax	$\frac{e^{z_i}}{\sum_j e^{z_j}}$	$f_i(\delta_{ij} - f_j)$
Mish	$z \cdot \tanh(\ln(1+e^z))$	See chain rule expansion

GELU (Gaussian Error Linear Unit)

\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]$$ $$\approx 0.5\,z\left(1 + \tanh\!\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715\,z^3\right)\right]\right)

Backpropagation

The chain rule applied layer-by-layer to compute gradients efficiently.

Core Algorithm 1986 — Rumelhart, Hinton, Williams

Chain Rule (Vector Form)

For loss $\mathcal{L}$ with respect to parameters in layer $l$:

\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{z}^{(L)}}\mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}$$ $$\boldsymbol{\delta}^{(l)} = \left(\mathbf{W}^{(l+1)\top}\boldsymbol{\delta}^{(l+1)}\right) \odot f'\!\left(\mathbf{z}^{(l)}\right)

Parameter Gradients

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \mathbf{h}^{(l-1)\top}$$ $$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}

Computational Complexity

For a network with $L$ layers and $n$ neurons per layer, backpropagation has $O(Ln^2)$ time complexity — the same as the forward pass — making it highly efficient compared to numerical differentiation.

Optimization Algorithms

Methods for traversing the loss landscape to find good minima.

Stochastic Gradient Descent (SGD)

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

SGD with Momentum

\mathbf{v}_{t+1} = \mu\, \mathbf{v}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}

Nesterov Accelerated Gradient

\mathbf{v}_{t+1} = \mu\, \mathbf{v}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t + \mu\,\mathbf{v}_t)$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}

AdaGrad

\mathbf{G}_{t} = \mathbf{G}_{t-1} + \mathbf{g}_t^2$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{G}_t + \epsilon}}\, \mathbf{g}_t

RMSProp

\mathbf{v}_t = \rho\,\mathbf{v}_{t-1} + (1-\rho)\,\mathbf{g}_t^2$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}}\, \mathbf{g}_t

Adam

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2$$ $$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}\,\hat{\mathbf{m}}_t

AdamW (Decoupled Weight Decay)

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda\,\boldsymbol{\theta}_t\right)

Regularization Techniques

Methods to prevent overfitting and improve generalization.

L1 Regularization (Lasso)

\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \sum_{l}\|\mathbf{W}^{(l)}\|_1 = \mathcal{L} + \lambda \sum_{l}\sum_{i,j}|W^{(l)}_{ij}|

L2 Regularization (Ridge / Weight Decay)

\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\sum_{l}\|\mathbf{W}^{(l)}\|_F^2 = \mathcal{L} + \frac{\lambda}{2}\sum_{l}\sum_{i,j}(W^{(l)}_{ij})^2

Dropout

During training, each neuron is independently set to zero with probability $p$:

\mathbf{m} \sim \text{Bernoulli}(1-p)$$ $$\tilde{\mathbf{h}}^{(l)} = \mathbf{m} \odot \mathbf{h}^{(l)}$$ $$\text{At test time:}\quad \mathbf{h}^{(l)}_{\text{test}} = (1-p)\,\mathbf{h}^{(l)}

Batch Normalization

\mu_B = \frac{1}{m}\sum_{i=1}^m z_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^m(z_i - \mu_B)^2$$ $$\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$ $$y_i = \gamma\,\hat{z}_i + \beta

Layer Normalization

\mu = \frac{1}{H}\sum_{i=1}^{H}h_i, \quad \sigma^2 = \frac{1}{H}\sum_{i=1}^{H}(h_i - \mu)^2$$ $$\hat{h}_i = \frac{h_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma\,\hat{h}_i + \beta

RMSNorm

\text{RMS}(\mathbf{h}) = \sqrt{\frac{1}{H}\sum_{i=1}^H h_i^2}$$ $$\hat{h}_i = \frac{h_i}{\text{RMS}(\mathbf{h})}\,\gamma_i

Convolutional Neural Network (CNN)

Networks exploiting spatial structure through shared local filters.

Supervised Computer Vision 1989 — LeCun

2D Convolution

For input $\mathbf{X} \in \mathbb{R}^{C_{in} \times H \times W}$ and filter $\mathbf{K} \in \mathbb{R}^{C_{in} \times k \times k}$:

(\mathbf{X} * \mathbf{K})[i,j] = \sum_{c=1}^{C_{in}}\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} X[c,\, i+m,\, j+n] \cdot K[c,\, m,\, n]

Output Dimensions

H_{\text{out}} = \left\lfloor\frac{H + 2p - k}{s}\right\rfloor + 1, \quad W_{\text{out}} = \left\lfloor\frac{W + 2p - k}{s}\right\rfloor + 1

Where $p$ is padding, $s$ is stride, $k$ is kernel size.

Depthwise Separable Convolution

Factorizes a standard convolution into a depthwise and pointwise step:

\text{Standard cost:}\quad C_{in} \cdot k^2 \cdot C_{out} \cdot H' \cdot W'$$ $$\text{Separable cost:}\quad C_{in} \cdot k^2 \cdot H' \cdot W' + C_{in} \cdot C_{out} \cdot H' \cdot W'$$ $$\text{Reduction ratio:}\quad \frac{1}{C_{out}} + \frac{1}{k^2}

Dilated (Atrous) Convolution

(\mathbf{X} *_d \mathbf{K})[i,j] = \sum_{m}\sum_{n} X[i + d \cdot m,\; j + d \cdot n] \cdot K[m, n]

Effective receptive field: $k + (k-1)(d-1)$, where $d$ is the dilation rate.

Pooling Operations

\text{Max Pool:}\quad y_{ij} = \max_{(m,n) \in \mathcal{R}_{ij}} x_{mn}$$ $$\text{Avg Pool:}\quad y_{ij} = \frac{1}{|\mathcal{R}_{ij}|}\sum_{(m,n) \in \mathcal{R}_{ij}} x_{mn}

Transposed Convolution

Used for upsampling. Equivalent to convolving with fractional strides or padding the input:

H_{\text{out}} = (H_{\text{in}} - 1) \cdot s - 2p + k + p_{\text{out}}

U-Net

Encoder-decoder with skip connections for dense prediction tasks. Backbone of all modern diffusion models.

Supervised Segmentation / Diffusion 2015 — Ronneberger et al.

Architecture

Symmetric encoder (contracting) and decoder (expanding) path with skip connections concatenating encoder features to decoder features at each resolution:

\text{Encoder:}\quad \mathbf{e}^{(l)} = \text{MaxPool}\!\left(\text{ConvBlock}(\mathbf{e}^{(l-1)})\right)$$ $$\text{Decoder:}\quad \mathbf{d}^{(l)} = \text{ConvBlock}\!\left([\text{UpConv}(\mathbf{d}^{(l+1)});\; \mathbf{e}^{(l)}]\right)

Where $[\cdot;\cdot]$ denotes channel-wise concatenation (skip connection).

ConvBlock

\text{ConvBlock}(\mathbf{x}) = \text{ReLU}(\text{BN}(\text{Conv}_{3\times3}(\text{ReLU}(\text{BN}(\text{Conv}_{3\times3}(\mathbf{x}))))))

U-Net in Diffusion Models

In DDPM / Stable Diffusion, the U-Net is conditioned on timestep $t$ and optional conditioning $c$:

\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) = \text{U-Net}(\mathbf{x}_t,\; \text{SinEmb}(t),\; \text{CrossAttn}(c))

Time embeddings are injected via addition/FiLM layers; text conditioning via cross-attention at each resolution level.

Encoder Decoder x ──[Conv]──┐ ┌──[Conv]── ŷ 64 │ skip │ 64 [Pool] ├───────────────→┤ [UpConv] ──[Conv]──┐ │ │ ┌──[Conv]── 128 │ │ skip │ │ 128 [Pool] ├─┼───────────────→┼─┤ [UpConv] ─[Conv]─┐ │ │ │ │ ┌─[Conv]─ 256 │ │ │ skip │ │ │ 256 [Pool] ├─┼─┼──────────────→┼─┼─┤[UpConv] ─[Conv]─╯ │ │ Bottleneck │ │ ╰─[Conv]─ 512 │ │ ────[Conv]── │ │ 512

Recurrent Neural Network (Vanilla RNN)

Networks with temporal memory via recurrent connections.

Supervised Sequence Modeling Temporal

Hidden State Dynamics

\mathbf{h}_t = \tanh\!\left(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h\right)$$ $$\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y

Backpropagation Through Time (BPTT)

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t} \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}

Vanishing/Exploding Gradient Problem

The product of Jacobians $\prod_j \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}$ can shrink or grow exponentially:

\left\|\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right\| \leq \left(\|\mathbf{W}_{hh}\| \cdot \gamma\right)^{t-k}

Where $\gamma = \max |f'(z)|$. If $\|\mathbf{W}_{hh}\| \cdot \gamma < 1$, gradients vanish; if $> 1$, they explode.

Long Short-Term Memory (LSTM)

Gated RNN architecture solving the vanishing gradient problem.

Supervised Sequence Modeling 1997 — Hochreiter & Schmidhuber

Gate Equations

\mathbf{f}_t = \sigma\!\left(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f\right) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma\!\left(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i\right) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh\!\left(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c\right) \quad \text{(candidate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell state)}$$ $$\mathbf{o}_t = \sigma\!\left(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o\right) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)

Gradient Flow through Cell State

The cell state provides a highway for gradients:

\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t)$$ $$\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_k} = \prod_{j=k+1}^{T}\text{diag}(\mathbf{f}_j)

When $\mathbf{f}_t \approx 1$, gradients flow unattenuated over many timesteps.

Parameter Count

\text{Params} = 4\left[(d_h + d_x)\cdot d_h + d_h\right]

Gated Recurrent Unit (GRU)

A simplified gating mechanism merging cell and hidden state.

Supervised Sequence Modeling 2014 — Cho et al.

Gate Equations

\mathbf{r}_t = \sigma\!\left(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r\right) \quad \text{(reset gate)}$$ $$\mathbf{z}_t = \sigma\!\left(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z\right) \quad \text{(update gate)}$$ $$\tilde{\mathbf{h}}_t = \tanh\!\left(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h\right)$$ $$\mathbf{h}_t = (1 - \mathbf{z}_t)\odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Parameter Count

\text{Params} = 3\left[(d_h + d_x)\cdot d_h + d_h\right]

GRU uses 25% fewer parameters than LSTM (3 gates vs 4).

Extended LSTM (xLSTM)

Modernized LSTM with exponential gating and matrix memory for LLM-scale performance.

Supervised Sequence Modeling 2024 — Beck et al.

sLSTM (Scalar Memory)

Extends LSTM with exponential gating and a normalizer state for numerical stability:

\mathbf{f}_t = \exp\!\left(\mathbf{w}_f^\top \mathbf{x}_t + b_f\right) \quad \text{(exponential forget gate)}$$ $$\mathbf{i}_t = \exp\!\left(\mathbf{w}_i^\top \mathbf{x}_t + b_i\right) \quad \text{(exponential input gate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$ $$\mathbf{n}_t = \mathbf{f}_t \odot \mathbf{n}_{t-1} + \mathbf{i}_t \quad \text{(normalizer state)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \frac{\mathbf{c}_t}{\mathbf{n}_t}

mLSTM (Matrix Memory)

Replaces the scalar cell state with a matrix $\mathbf{C}_t \in \mathbb{R}^{d \times d}$, enabling key-value storage:

\mathbf{k}_t = \mathbf{W}_k\mathbf{x}_t, \quad \mathbf{v}_t = \mathbf{W}_v\mathbf{x}_t, \quad \mathbf{q}_t = \mathbf{W}_q\mathbf{x}_t$$ $$\mathbf{C}_t = f_t\,\mathbf{C}_{t-1} + i_t\,\mathbf{v}_t\mathbf{k}_t^\top$$ $$\mathbf{n}_t = f_t\,\mathbf{n}_{t-1} + i_t\,\mathbf{k}_t$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \frac{\mathbf{C}_t\,\mathbf{q}_t}{\max(|\mathbf{n}_t^\top\mathbf{q}_t|, 1)}

mLSTM is fully parallelizable (no hidden-to-hidden recurrence) and can be viewed as a linearized self-attention with a decay factor.

Bidirectional RNN

Processing sequences in both forward and backward directions.

Sequence Modeling 1997 — Schuster & Paliwal

Architecture

\overrightarrow{\mathbf{h}}_t = f\!\left(\mathbf{W}_{\overrightarrow{h}}\,\overrightarrow{\mathbf{h}}_{t-1} + \mathbf{W}_{x\overrightarrow{h}}\,\mathbf{x}_t + \mathbf{b}_{\overrightarrow{h}}\right)$$ $$\overleftarrow{\mathbf{h}}_t = f\!\left(\mathbf{W}_{\overleftarrow{h}}\,\overleftarrow{\mathbf{h}}_{t+1} + \mathbf{W}_{x\overleftarrow{h}}\,\mathbf{x}_t + \mathbf{b}_{\overleftarrow{h}}\right)$$ $$\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\, \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2d_h}$$ $$\mathbf{y}_t = \mathbf{W}_y\,\mathbf{h}_t + \mathbf{b}_y

Attention Mechanism

Learning to focus on relevant parts of the input.

Core Mechanism 2014 — Bahdanau et al.

Additive (Bahdanau) Attention

e_{ij} = \mathbf{v}^\top \tanh\!\left(\mathbf{W}_1\mathbf{h}_i + \mathbf{W}_2\mathbf{s}_j\right)$$ $$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k}\exp(e_{kj})}$$ $$\mathbf{c}_j = \sum_i \alpha_{ij}\,\mathbf{h}_i

Multiplicative (Luong) Attention

e_{ij} = \mathbf{s}_j^\top \mathbf{W} \mathbf{h}_i \quad\text{(general)}$$ $$e_{ij} = \mathbf{s}_j^\top \mathbf{h}_i \quad\text{(dot)}

Scaled Dot-Product Attention

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}

The $\sqrt{d_k}$ scaling prevents softmax saturation when dot products grow large.

Transformer

Attention-only architecture that revolutionized NLP and beyond.

Supervised / Self-Supervised NLP / Vision / Multimodal 2017 — Vaswani et al.

Multi-Head Attention

\mathbf{Q}_i = \mathbf{X}\mathbf{W}_i^Q,\quad \mathbf{K}_i = \mathbf{X}\mathbf{W}_i^K,\quad \mathbf{V}_i = \mathbf{X}\mathbf{W}_i^V$$ $$\text{head}_i = \text{Attention}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i)$$ $$\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\,\mathbf{W}^O

Where $\mathbf{W}_i^Q, \mathbf{W}_i^K \in \mathbb{R}^{d \times d_k}$, $\mathbf{W}_i^V \in \mathbb{R}^{d \times d_v}$, $\mathbf{W}^O \in \mathbb{R}^{hd_v \times d}$.

Sinusoidal Positional Encoding

\text{PE}_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right)$$ $$\text{PE}_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Rotary Positional Embedding (RoPE)

\mathbf{R}_\theta^{(m)} = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 \\ \sin m\theta_1 & \cos m\theta_1 \\ & & \cos m\theta_2 & -\sin m\theta_2 \\ & & \sin m\theta_2 & \cos m\theta_2 \\ & & & & \ddots \end{pmatrix}$$ $$\mathbf{q}_m^\top \mathbf{k}_n = (\mathbf{R}_\theta^{(m)}\mathbf{W}_q \mathbf{x}_m)^\top (\mathbf{R}_\theta^{(n)}\mathbf{W}_k \mathbf{x}_n)

Feed-Forward Network (per position)

\text{FFN}(\mathbf{x}) = \mathbf{W}_2\,\text{GELU}(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2

Encoder Block

\mathbf{x}' = \text{LayerNorm}(\mathbf{x} + \text{MultiHead}(\mathbf{x}))$$ $$\mathbf{x}'' = \text{LayerNorm}(\mathbf{x}' + \text{FFN}(\mathbf{x}'))

Decoder Block (with causal mask)

The causal mask $\mathbf{M}$ sets future positions to $-\infty$ before softmax:

M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$$ $$\text{CausalAttn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}

Grouped-Query Attention (GQA)

Shares key-value heads across groups of query heads to reduce memory:

\text{KV heads} = \frac{h}{G}, \quad \text{Each KV head serves } G \text{ query heads}

Computational Complexity

\text{Self-Attention:}\quad O(n^2 \cdot d)$$ $$\text{FFN:}\quad O(n \cdot d^2)$$ $$\text{Total per layer:}\quad O(n^2 d + n d^2)

BERT (Encoder-Only Transformer)

Bidirectional encoder pre-trained with masked language modeling — the foundation for NLU tasks.

Self-Supervised → Supervised NLU / Classification / NER 2018 — Devlin et al.

Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict the originals:

\mathcal{L}_{\text{MLM}} = -\mathbb{E}\!\left[\sum_{i \in \mathcal{M}} \log P_\theta(x_i | \mathbf{x}_{\backslash\mathcal{M}})\right]

Of the 15% selected: 80% are replaced with [MASK], 10% with a random token, 10% unchanged.

Next Sentence Prediction (NSP)

P(\text{IsNext} | [\text{CLS}]\, A\, [\text{SEP}]\, B) = \sigma(\mathbf{w}^\top \mathbf{h}_{[\text{CLS}]} + b)

Input Representation

\mathbf{h}_0 = \mathbf{E}_{\text{tok}}(\mathbf{x}) + \mathbf{E}_{\text{pos}} + \mathbf{E}_{\text{seg}}

Segment embeddings distinguish sentence A from B. The [CLS] token representation is used for classification tasks.

Fine-Tuning

\text{Classification:}\quad \hat{y} = \text{softmax}(\mathbf{W}\,\mathbf{h}_{[\text{CLS}]} + \mathbf{b})$$ $$\text{Token-level (NER):}\quad \hat{y}_i = \text{softmax}(\mathbf{W}\,\mathbf{h}_i + \mathbf{b})

Model	Layers	Hidden	Heads	Params
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M
RoBERTa	24	1024	16	355M

Sequence-to-Sequence (Encoder-Decoder)

Mapping variable-length input sequences to variable-length output sequences.

Supervised Translation / Summarization 2014 — Sutskever et al.

RNN-Based Seq2Seq

\text{Encoder:}\quad \mathbf{h}_t^{\text{enc}} = f_{\text{enc}}(\mathbf{x}_t, \mathbf{h}_{t-1}^{\text{enc}})$$ $$\mathbf{c} = \mathbf{h}_T^{\text{enc}} \quad \text{(context vector = final encoder state)}$$ $$\text{Decoder:}\quad \mathbf{h}_t^{\text{dec}} = f_{\text{dec}}(y_{t-1}, \mathbf{h}_{t-1}^{\text{dec}}, \mathbf{c})$$ $$P(y_t | y_{<t}, \mathbf{x}) = \text{softmax}(\mathbf{W}_o\,\mathbf{h}_t^{\text{dec}})

Seq2Seq with Attention

\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j\exp(e_{tj})}, \quad e_{ti} = \text{score}(\mathbf{h}_t^{\text{dec}}, \mathbf{h}_i^{\text{enc}})$$ $$\mathbf{c}_t = \sum_i \alpha_{ti}\,\mathbf{h}_i^{\text{enc}}$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t;\,\mathbf{h}_t^{\text{dec}}])

Transformer Encoder-Decoder (T5 / BART)

\text{Encoder:}\quad \mathbf{H}^{\text{enc}} = \text{TransformerEncoder}(\mathbf{x})$$ $$\text{Decoder layer:}\quad \mathbf{h}' = \text{CausalSelfAttn}(\mathbf{h}) + \mathbf{h}$$ $$\mathbf{h}'' = \text{CrossAttn}(\mathbf{h}', \mathbf{H}^{\text{enc}}) + \mathbf{h}'$$ $$\mathbf{h}''' = \text{FFN}(\mathbf{h}'') + \mathbf{h}''

Teacher Forcing

\text{Training:}\quad \hat{y}_t = f(y_{t-1}^{\text{gold}}, \mathbf{h}_{t-1}) \quad \text{(use ground truth as input)}$$ $$\text{Inference:}\quad \hat{y}_t = f(\hat{y}_{t-1}, \mathbf{h}_{t-1}) \quad \text{(use model's own prediction)}

Vision Transformer (ViT)

Applying the transformer architecture directly to image patches.

Supervised / Self-Supervised Computer Vision 2020 — Dosovitskiy et al.

Patch Embedding

An image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ is split into $N$ patches of size $P \times P$:

N = \frac{H \cdot W}{P^2}$$ $$\mathbf{x}_p^{(i)} \in \mathbb{R}^{P^2 \cdot C} \quad \text{(flattened patch } i\text{)}$$ $$\mathbf{z}_0^{(i)} = \mathbf{x}_p^{(i)}\,\mathbf{E} + \mathbf{e}_{\text{pos}}^{(i)}, \quad \mathbf{E} \in \mathbb{R}^{(P^2 C) \times d}

CLS Token

\mathbf{z}_0 = [\mathbf{x}_{\text{cls}};\; \mathbf{z}_0^{(1)};\; \mathbf{z}_0^{(2)};\; \dots;\; \mathbf{z}_0^{(N)}] + \mathbf{E}_{\text{pos}}$$ $$\hat{y} = \text{MLP}(\text{LayerNorm}(\mathbf{z}_L^{(0)}))

Full Forward Pass

\mathbf{z}'_l = \text{MSA}(\text{LN}(\mathbf{z}_{l-1})) + \mathbf{z}_{l-1}$$ $$\mathbf{z}_l = \text{FFN}(\text{LN}(\mathbf{z}'_l)) + \mathbf{z}'_l

Variants

Model	Patch Size	Layers	Hidden	Heads	Params
ViT-B/16	16	12	768	12	86M
ViT-L/16	16	24	1024	16	307M
ViT-H/14	14	32	1280	16	632M

RWKV (Receptance Weighted Key Value)

Linear-complexity RNN that matches transformer quality — trainable like a transformer, runs like an RNN.

Self-Supervised Language Modeling 2023 — Peng et al.

Time Mixing (Attention Replacement)

\mathbf{r}_t = \mathbf{W}_r(\mu_r \odot \mathbf{x}_t + (1-\mu_r)\odot\mathbf{x}_{t-1})$$ $$\mathbf{k}_t = \mathbf{W}_k(\mu_k \odot \mathbf{x}_t + (1-\mu_k)\odot\mathbf{x}_{t-1})$$ $$\mathbf{v}_t = \mathbf{W}_v(\mu_v \odot \mathbf{x}_t + (1-\mu_v)\odot\mathbf{x}_{t-1})

WKV Mechanism (Linear Attention)

\text{wkv}_t = \frac{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_i}\mathbf{v}_i + e^{u+k_t}\mathbf{v}_t}{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_i} + e^{u+k_t}}$$ $$\mathbf{o}_t = \sigma(\mathbf{r}_t) \odot \text{wkv}_t

Where $w$ is a learned decay vector and $u$ is a learned bonus for the current token. This can be computed recurrently in $O(1)$ per step.

Channel Mixing (FFN Replacement)

\mathbf{r}_t' = \sigma(\mathbf{W}_{r'}(\mu_{r'}\odot\mathbf{x}_t + (1-\mu_{r'})\odot\mathbf{x}_{t-1}))$$ $$\mathbf{k}_t' = \mathbf{W}_{k'}(\mu_{k'}\odot\mathbf{x}_t + (1-\mu_{k'})\odot\mathbf{x}_{t-1})$$ $$\mathbf{o}_t = \mathbf{r}_t' \odot (\mathbf{W}_v'\,\max(\mathbf{k}_t', 0)^2)

Complexity

\text{Training:}\quad O(Td) \quad\text{(parallelizable like transformer)}$$ $$\text{Inference:}\quad O(d) \text{ per token} \quad\text{(constant, like RNN)}

Autoencoder

Learning compressed representations via reconstruction.

Unsupervised Representation Learning

Architecture

\text{Encoder:}\quad \mathbf{z} = f_\phi(\mathbf{x}) = \sigma(\mathbf{W}_e\mathbf{x} + \mathbf{b}_e)$$ $$\text{Decoder:}\quad \hat{\mathbf{x}} = g_\theta(\mathbf{z}) = \sigma(\mathbf{W}_d\mathbf{z} + \mathbf{b}_d)$$ $$\mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2

Denoising Autoencoder

\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I})$$ $$\mathcal{L}_{\text{DAE}} = \|\mathbf{x} - g_\theta(f_\phi(\tilde{\mathbf{x}}))\|^2

Sparse Autoencoder

\mathcal{L}_{\text{sparse}} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 + \lambda \sum_j \text{KL}(\rho \,\|\, \hat{\rho}_j)$$ $$\text{KL}(\rho\,\|\,\hat{\rho}_j) = \rho\log\frac{\rho}{\hat{\rho}_j} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j}

Variational Autoencoder (VAE)

Probabilistic generative model with a learned latent space.

Generative Latent Variable Model 2013 — Kingma & Welling

Generative Model

p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z})\,p(\mathbf{z})\,d\mathbf{z}, \quad p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})

Evidence Lower Bound (ELBO)

\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \text{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z})\right) = \text{ELBO}

Reparameterization Trick

q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}\!\left(\boldsymbol{\mu}_\phi(\mathbf{x}),\, \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x}))\right)$$ $$\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

KL Divergence (Closed Form for Gaussians)

\text{KL}(q\,\|\,p) = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

Loss Function

\mathcal{L}_{\text{VAE}} = -\mathbb{E}_{q_\phi}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] + \text{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x})\,\|\,p(\mathbf{z})\right)

Generative Adversarial Network (GAN)

Two networks competing in a minimax game to generate realistic data.

Generative Image Synthesis 2014 — Goodfellow et al.

Minimax Objective

\min_G \max_D\; V(D, G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\!\left[\log D(\mathbf{x})\right] + \mathbb{E}_{\mathbf{z}\sim p_z}\!\left[\log(1 - D(G(\mathbf{z})))\right]

Optimal Discriminator

D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}

Global Optimum

At the Nash equilibrium, $p_g = p_{\text{data}}$ and $D^*(\mathbf{x}) = \frac{1}{2}$:

V(D^*, G^*) = -\log 4$$ $$C(G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \,\|\, p_g)

Wasserstein GAN (WGAN)

W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1}\; \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{x}\sim p_g}[f(\mathbf{x})]

The critic $f$ (replacing the discriminator) is enforced to be 1-Lipschitz via gradient penalty:

\mathcal{L}_{\text{GP}} = \lambda\,\mathbb{E}_{\hat{\mathbf{x}}}\!\left[\left(\|\nabla_{\hat{\mathbf{x}}} f(\hat{\mathbf{x}})\|_2 - 1\right)^2\right]$$ $$\hat{\mathbf{x}} = \alpha\,\mathbf{x}_{\text{real}} + (1-\alpha)\,\mathbf{x}_{\text{fake}},\quad \alpha\sim U[0,1]

Diffusion Models (DDPM)

Generating data by learning to reverse a gradual noising process.

Generative Image / Audio / Video 2020 — Ho, Jain, Abbeel

Forward Process (Diffusion)

q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t\mathbf{I}\right)$$ $$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\, (1-\bar{\alpha}_t)\mathbf{I}\right)

Where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.

Reverse Process

p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2\mathbf{I}\right)$$ $$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right)

Training Objective (Simplified)

\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\right)\right\|^2\right]

Score-Based Formulation

\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) = -\sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t) = -\sqrt{1-\bar{\alpha}_t}\,\mathbf{s}_\theta(\mathbf{x}_t, t)

Classifier-Free Guidance

\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, c) = (1+w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)

Normalizing Flows

Exact likelihood models using invertible transformations.

Generative Exact Likelihood

Change of Variables

\mathbf{x} = f(\mathbf{z}), \quad \mathbf{z} = f^{-1}(\mathbf{x})$$ $$\log p_\mathbf{x}(\mathbf{x}) = \log p_\mathbf{z}(f^{-1}(\mathbf{x})) + \log\left|\det\frac{\partial f^{-1}}{\partial \mathbf{x}}\right|

Composition of Flows

\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})$$ $$\log p(\mathbf{x}) = \log p(\mathbf{z}) - \sum_{k=1}^{K}\log\left|\det\frac{\partial f_k}{\partial \mathbf{h}_{k-1}}\right|

Coupling Layer (RealNVP)

\mathbf{x}_{1:d} = \mathbf{z}_{1:d}$$ $$\mathbf{x}_{d+1:D} = \mathbf{z}_{d+1:D} \odot \exp\!\left(s(\mathbf{z}_{1:d})\right) + t(\mathbf{z}_{1:d})

The Jacobian is triangular, so $\det = \prod \exp(s_i) = \exp(\sum s_i)$, computed in $O(D)$.

Energy-Based Models

Defining probability distributions via scalar energy functions.

Generative Unnormalized

Energy Function

p_\theta(\mathbf{x}) = \frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \quad Z_\theta = \int \exp(-E_\theta(\mathbf{x}))\,d\mathbf{x}

Score Matching

\mathcal{L}_{\text{SM}} = \mathbb{E}_{p_{\text{data}}}\!\left[\frac{1}{2}\|\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\|^2 + \text{tr}(\nabla^2_\mathbf{x} \log p_\theta(\mathbf{x}))\right]

Contrastive Divergence

\nabla_\theta \log p_\theta(\mathbf{x}) = -\nabla_\theta E_\theta(\mathbf{x}) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(\mathbf{x})]$$ $$\approx -\nabla_\theta E_\theta(\mathbf{x}_{\text{data}}) + \nabla_\theta E_\theta(\tilde{\mathbf{x}})

Where $\tilde{\mathbf{x}}$ is obtained from a few steps of MCMC starting from data.

Siamese Networks & Contrastive Learning

Learning representations by comparing pairs or groups of inputs — foundation of CLIP, SimCLR, and modern self-supervised vision.

Self-Supervised Representation Learning 1993 — Bromley et al. / 2020 — Chen et al.

Siamese Network

Two identical networks sharing weights process two inputs and compare their embeddings:

\mathbf{z}_1 = f_\theta(\mathbf{x}_1), \quad \mathbf{z}_2 = f_\theta(\mathbf{x}_2)$$ $$d(\mathbf{x}_1, \mathbf{x}_2) = \|\mathbf{z}_1 - \mathbf{z}_2\|_2

Contrastive Loss

\mathcal{L}_{\text{contrastive}} = (1-y)\frac{1}{2}d^2 + y\frac{1}{2}\max(0, m - d)^2

Where $y=0$ for similar pairs, $y=1$ for dissimilar, and $m$ is the margin.

Triplet Loss

\mathcal{L}_{\text{triplet}} = \max\!\left(0,\; \|f(\mathbf{x}_a) - f(\mathbf{x}_p)\|^2 - \|f(\mathbf{x}_a) - f(\mathbf{x}_n)\|^2 + \alpha\right)

NT-Xent Loss (SimCLR)

Normalized temperature-scaled cross-entropy over a batch of $2N$ augmented pairs:

\text{sim}(\mathbf{z}_i, \mathbf{z}_j) = \frac{\mathbf{z}_i^\top\mathbf{z}_j}{\|\mathbf{z}_i\|\,\|\mathbf{z}_j\|}$$ $$\ell_{i,j} = -\log\frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau)}{\sum_{k=1}^{2N}\mathbf{1}_{[k\neq i]}\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau)}

CLIP (Contrastive Language-Image Pre-training)

Aligns image and text embeddings using a symmetric contrastive loss over a batch of $N$ image-text pairs:

\mathbf{z}_I = f_{\text{image}}(\mathbf{x}_I), \quad \mathbf{z}_T = f_{\text{text}}(\mathbf{x}_T)$$ $$\text{logits} = \mathbf{Z}_I\,\mathbf{Z}_T^\top \cdot e^\tau$$ $$\mathcal{L}_{\text{CLIP}} = \frac{1}{2}\left(\text{CE}(\text{logits},\, \mathbf{I}_N) + \text{CE}(\text{logits}^\top,\, \mathbf{I}_N)\right)

BYOL / SimSiam (No Negatives)

\mathcal{L}_{\text{BYOL}} = 2 - 2\cdot\frac{\langle q_\theta(\mathbf{z}_1),\, \text{sg}(\mathbf{z}_2')\rangle}{\|q_\theta(\mathbf{z}_1)\|\,\|\mathbf{z}_2'\|}

Where $\text{sg}(\cdot)$ is stop-gradient, $\mathbf{z}_2'$ comes from an EMA target encoder, and $q_\theta$ is a predictor MLP.

JEPA (Joint Embedding Predictive Architecture)

Yann LeCun's proposed path to human-level AI — predicting in latent space rather than pixel space.

Self-Supervised Representation Learning / World Models 2022 — LeCun / 2023 — Assran et al. (I-JEPA)

Core Principle

Unlike generative models (which predict pixels) or contrastive models (which compare positive/negative pairs), JEPA predicts the representation of a target from a context — entirely in embedding space:

\text{Generative:}\quad \text{predict } \mathbf{x} \;\text{(pixel space)}$$ $$\text{Contrastive:}\quad \text{maximize } \text{sim}(f(\mathbf{x}), f(\mathbf{x}^+)) \text{ vs } f(\mathbf{x}^-)$$ $$\text{JEPA:}\quad \text{predict } \bar{f}(\mathbf{y}) \text{ from } f(\mathbf{x}) \;\text{(latent space)}

Architecture

\mathbf{s}_x = f_\theta(\mathbf{x}) \quad \text{(context encoder)}$$ $$\bar{\mathbf{s}}_y = \bar{f}_{\bar{\theta}}(\mathbf{y}) \quad \text{(target encoder — EMA of } f_\theta\text{)}$$ $$\hat{\mathbf{s}}_y = g_\phi(\mathbf{s}_x, \mathbf{m}) \quad \text{(predictor, conditioned on mask } \mathbf{m}\text{)}

Loss Function

\mathcal{L}_{\text{JEPA}} = \|\hat{\mathbf{s}}_y - \text{sg}(\bar{\mathbf{s}}_y)\|^2

Where $\text{sg}(\cdot)$ is stop-gradient. The target encoder $\bar{f}$ is updated via exponential moving average (EMA):

\bar{\theta} \leftarrow \alpha\,\bar{\theta} + (1-\alpha)\,\theta, \quad \alpha \in [0.996, 1)

I-JEPA (Image JEPA)

The context encoder sees a partial view of the image (with masked patches), and the predictor must predict target block representations in latent space:

\mathbf{x}_{\text{context}} = \text{ViT}_\theta(\text{visible patches})$$ $$\hat{\mathbf{s}}_{y_m} = g_\phi(\mathbf{x}_{\text{context}}, \text{pos}(y_m)) \quad \forall\, m \in \text{target blocks}$$ $$\mathcal{L}_{\text{I-JEPA}} = \frac{1}{M}\sum_{m=1}^M \|\hat{\mathbf{s}}_{y_m} - \text{sg}(\bar{\mathbf{s}}_{y_m})\|^2

V-JEPA (Video JEPA)

Extends to video by masking spacetime tubes and predicting their latent representations:

\mathbf{x} \in \mathbb{R}^{T \times H \times W \times C} \rightarrow \text{mask spacetime tubes} \rightarrow \text{predict in latent space}

JEPA vs Other Paradigms

Method	Prediction Space	Negatives?	Collapse Prevention
Autoencoder	Pixel / Input	No	Bottleneck
Contrastive (SimCLR)	Latent (similarity)	Yes	Negative pairs
BYOL / SimSiam	Latent	No	EMA + stop-gradient
JEPA	Latent (prediction)	No	EMA + stop-gradient + masking

Graph Neural Networks

Neural networks operating on graph-structured data.

Supervised / Semi-Supervised Graphs & Networks

Message Passing Framework

\mathbf{m}_v^{(l)} = \bigoplus_{u \in \mathcal{N}(v)} M^{(l)}\!\left(\mathbf{h}_v^{(l)}, \mathbf{h}_u^{(l)}, \mathbf{e}_{vu}\right)$$ $$\mathbf{h}_v^{(l+1)} = U^{(l)}\!\left(\mathbf{h}_v^{(l)}, \mathbf{m}_v^{(l)}\right)

Graph Convolutional Network (GCN)

\mathbf{H}^{(l+1)} = \sigma\!\left(\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}\mathbf{H}^{(l)}\mathbf{W}^{(l)}\right)

Where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (adjacency with self-loops), $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$.

Graph Attention Network (GAT)

e_{ij} = \text{LeakyReLU}\!\left(\mathbf{a}^\top [\mathbf{W}\mathbf{h}_i \,\|\, \mathbf{W}\mathbf{h}_j]\right)$$ $$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}(i)}\exp(e_{ik})}$$ $$\mathbf{h}_i' = \sigma\!\left(\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\,\mathbf{W}\mathbf{h}_j\right)

GraphSAGE

\mathbf{h}_{\mathcal{N}(v)}^{(l)} = \text{AGGREGATE}\!\left(\left\{\mathbf{h}_u^{(l)}: u \in \mathcal{N}(v)\right\}\right)$$ $$\mathbf{h}_v^{(l+1)} = \sigma\!\left(\mathbf{W}^{(l)}\cdot[\mathbf{h}_v^{(l)} \,\|\, \mathbf{h}_{\mathcal{N}(v)}^{(l)}]\right)

Graph Readout

\mathbf{h}_G = \text{READOUT}\!\left(\left\{\mathbf{h}_v^{(L)} : v \in G\right\}\right) = \frac{1}{|V|}\sum_{v\in V}\mathbf{h}_v^{(L)}

Capsule Networks

Encoding part-whole relationships with vector-valued capsules.

Supervised Computer Vision 2017 — Sabour, Frosst, Hinton

Squash Function

\text{squash}(\mathbf{s}_j) = \frac{\|\mathbf{s}_j\|^2}{1 + \|\mathbf{s}_j\|^2}\cdot\frac{\mathbf{s}_j}{\|\mathbf{s}_j\|}

Dynamic Routing

\hat{\mathbf{u}}_{j|i} = \mathbf{W}_{ij}\,\mathbf{u}_i \quad \text{(prediction vectors)}$$ $$c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})} \quad \text{(coupling coefficients)}$$ $$\mathbf{s}_j = \sum_i c_{ij}\,\hat{\mathbf{u}}_{j|i}, \quad \mathbf{v}_j = \text{squash}(\mathbf{s}_j)$$ $$b_{ij} \leftarrow b_{ij} + \hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j \quad \text{(routing update)}

Margin Loss

\mathcal{L}_k = T_k \max(0, m^+ - \|\mathbf{v}_k\|)^2 + \lambda(1-T_k)\max(0, \|\mathbf{v}_k\| - m^-)^2

Hopfield Network

Associative memory via energy minimization in a fully connected network.

Unsupervised Associative Memory 1982 — Hopfield

Energy Function

E = -\frac{1}{2}\sum_{i\neq j} w_{ij}\,s_i\,s_j - \sum_i \theta_i\,s_i = -\frac{1}{2}\mathbf{s}^\top\mathbf{W}\mathbf{s} - \boldsymbol{\theta}^\top\mathbf{s}

Hebbian Learning (Storage)

w_{ij} = \frac{1}{N}\sum_{\mu=1}^{P}\xi_i^\mu\,\xi_j^\mu, \quad w_{ii}=0$$ $$\mathbf{W} = \frac{1}{N}\sum_{\mu=1}^{P}\boldsymbol{\xi}^\mu (\boldsymbol{\xi}^\mu)^\top - \frac{P}{N}\mathbf{I}

Update Rule (Asynchronous)

s_i \leftarrow \text{sgn}\!\left(\sum_j w_{ij}\,s_j + \theta_i\right)

Storage Capacity

P_{\max} \approx \frac{N}{2\ln N}

Modern Hopfield Network (2020)

E = -\text{lse}(\beta\,\boldsymbol{\Xi}^\top\boldsymbol{\xi}) + \frac{1}{2}\boldsymbol{\xi}^\top\boldsymbol{\xi} + \text{const}$$ $$\text{Update:}\quad \boldsymbol{\xi}_{\text{new}} = \boldsymbol{\Xi}\,\text{softmax}(\beta\,\boldsymbol{\Xi}^\top\boldsymbol{\xi})

This update rule is equivalent to the attention mechanism in transformers.

Boltzmann Machine

Stochastic neural network based on statistical mechanics.

Generative Stochastic 1985 — Hinton & Sejnowski

Energy Function

E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^\top\mathbf{W}\mathbf{h} - \mathbf{b}^\top\mathbf{v} - \mathbf{c}^\top\mathbf{h} - \frac{1}{2}\mathbf{v}^\top\mathbf{L}\mathbf{v} - \frac{1}{2}\mathbf{h}^\top\mathbf{J}\mathbf{h}

Probability Distribution

p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\exp(-E(\mathbf{v}, \mathbf{h})), \quad Z = \sum_{\mathbf{v},\mathbf{h}}\exp(-E(\mathbf{v},\mathbf{h}))

Stochastic Update

p(s_i = 1 | \mathbf{s}_{-i}) = \sigma\!\left(\sum_j w_{ij} s_j + b_i\right)

Restricted Boltzmann Machine (RBM)

A bipartite Boltzmann machine enabling efficient training via Gibbs sampling.

Generative 2006 — Hinton

Energy

E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^\top\mathbf{W}\mathbf{h} - \mathbf{b}^\top\mathbf{v} - \mathbf{c}^\top\mathbf{h}

Conditional Distributions

p(h_j = 1|\mathbf{v}) = \sigma\!\left(\mathbf{W}_{:,j}^\top\mathbf{v} + c_j\right)$$ $$p(v_i = 1|\mathbf{h}) = \sigma\!\left(\mathbf{W}_{i,:}\mathbf{h} + b_i\right)

Contrastive Divergence (CD-k)

\Delta \mathbf{W} = \eta\left(\langle\mathbf{v}\mathbf{h}^\top\rangle_{\text{data}} - \langle\mathbf{v}\mathbf{h}^\top\rangle_{\text{recon}}\right)

Free Energy

F(\mathbf{v}) = -\mathbf{b}^\top\mathbf{v} - \sum_j \log\!\left(1 + \exp(\mathbf{W}_{:,j}^\top\mathbf{v} + c_j)\right)

Radial Basis Function Network

Using radial basis functions as activation in a single hidden layer.

Supervised Function Approximation

Architecture

\phi_j(\mathbf{x}) = \exp\!\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_j\|^2}{2\sigma_j^2}\right)$$ $$f(\mathbf{x}) = \sum_{j=1}^{K} w_j\,\phi_j(\mathbf{x}) + b = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}) + b

Training

Typically a two-phase process: (1) find centers $\boldsymbol{\mu}_j$ via k-means clustering; (2) solve for weights $\mathbf{w}$ via least squares:

\mathbf{w}^* = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{y}

Where $\Phi_{ij} = \phi_j(\mathbf{x}_i)$ is the interpolation matrix.

Self-Organizing Map (SOM)

Unsupervised learning that maps high-dimensional data to a low-dimensional grid preserving topology.

Unsupervised Dimensionality Reduction 1982 — Kohonen

Best Matching Unit (BMU)

c = \arg\min_j \|\mathbf{x} - \mathbf{w}_j\|

Weight Update

\mathbf{w}_j(t+1) = \mathbf{w}_j(t) + \eta(t)\,h_{cj}(t)\,\left(\mathbf{x}(t) - \mathbf{w}_j(t)\right)

Neighborhood Function

h_{cj}(t) = \exp\!\left(-\frac{\|r_c - r_j\|^2}{2\sigma(t)^2}\right)

Both $\eta(t)$ and $\sigma(t)$ decrease monotonically over training.

Residual Networks (ResNet)

Skip connections enabling training of very deep networks.

Supervised Computer Vision 2015 — He et al.

Residual Block

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

The network learns the residual $\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.

Bottleneck Block

\mathcal{F}(\mathbf{x}) = \mathbf{W}_3\,\text{ReLU}\!\left(\text{BN}\!\left(\mathbf{W}_2\,\text{ReLU}\!\left(\text{BN}(\mathbf{W}_1\mathbf{x})\right)\right)\right)

$\mathbf{W}_1$ reduces channels (1×1), $\mathbf{W}_2$ is 3×3 conv, $\mathbf{W}_3$ expands channels (1×1).

Gradient Flow

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}\left(1 + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1}\mathcal{F}(\mathbf{x}_i)\right)

The "1 +" term ensures gradients can flow directly to any layer without attenuation.

Pre-Activation ResNet

\mathbf{y} = \mathbf{x} + \mathbf{W}_2\,\text{ReLU}\!\left(\text{BN}\!\left(\mathbf{W}_1\,\text{ReLU}(\text{BN}(\mathbf{x}))\right)\right)

Neural Ordinary Differential Equations

Continuous-depth networks defined by differential equations.

Architecture 2018 — Chen et al.

Continuous Dynamics

\frac{d\mathbf{h}(t)}{dt} = f_\theta(\mathbf{h}(t), t)$$ $$\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f_\theta(\mathbf{h}(t), t)\,dt

Adjoint Method (Memory-Efficient Backprop)

\mathbf{a}(t) = -\frac{\partial \mathcal{L}}{\partial \mathbf{h}(t)}$$ $$\frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f_\theta}{\partial \mathbf{h}}$$ $$\frac{d\mathcal{L}}{d\theta} = -\int_T^0 \mathbf{a}(t)^\top \frac{\partial f_\theta(\mathbf{h}(t), t)}{\partial \theta}\,dt

Memory cost is $O(1)$ regardless of depth, since states are recomputed during the backward ODE solve.

Connection to ResNets

\text{ResNet:}\quad \mathbf{h}_{t+1} = \mathbf{h}_t + f_\theta(\mathbf{h}_t) \quad\longleftrightarrow\quad \text{Neural ODE:}\quad \frac{d\mathbf{h}}{dt} = f_\theta(\mathbf{h}, t)

Echo State Network (Reservoir Computing)

A fixed random recurrent reservoir with only output weights trained.

Supervised Time Series 2001 — Jaeger

Reservoir Dynamics

\mathbf{h}(t) = (1-\alpha)\mathbf{h}(t-1) + \alpha\,\tanh\!\left(\mathbf{W}_{\text{res}}\mathbf{h}(t-1) + \mathbf{W}_{\text{in}}\mathbf{x}(t) + \mathbf{b}\right)

$\mathbf{W}_{\text{res}}$ and $\mathbf{W}_{\text{in}}$ are random and fixed. $\alpha$ is the leaking rate.

Output (Readout)

\mathbf{y}(t) = \mathbf{W}_{\text{out}}\,[\mathbf{h}(t);\, \mathbf{x}(t)]$$ $$\mathbf{W}_{\text{out}} = \mathbf{Y}\mathbf{H}^\top(\mathbf{H}\mathbf{H}^\top + \lambda\mathbf{I})^{-1}

Echo State Property

The reservoir must satisfy the echo state property: the spectral radius $\rho(\mathbf{W}_{\text{res}}) < 1$ ensures that the effect of initial conditions fades over time.

Spiking Neural Network

Biologically plausible networks where neurons communicate via discrete spikes.

Neuromorphic Event-Driven

Leaky Integrate-and-Fire (LIF) Model

\tau_m \frac{dV(t)}{dt} = -[V(t) - V_{\text{rest}}] + R\,I(t)$$ $$\text{If } V(t) \geq V_{\text{th}}:\quad \text{emit spike, } V(t) \leftarrow V_{\text{reset}}

Discrete LIF

V[t] = \beta\,V[t-1] + \sum_j w_j\,S_j[t] - V_{\text{th}}\,S_{\text{out}}[t-1]$$ $$S_{\text{out}}[t] = \Theta(V[t] - V_{\text{th}})

Where $\beta = \exp(-\Delta t / \tau_m)$ is the decay factor and $\Theta$ is the Heaviside step function.

Surrogate Gradient

Since $\Theta'(x) = \delta(x)$ is not useful for backprop, replace with a smooth surrogate:

\tilde{\Theta}'(x) = \frac{1}{\pi}\cdot\frac{1}{1 + (\pi x)^2} \quad\text{(fast sigmoid surrogate)}

Spike-Timing-Dependent Plasticity (STDP)

\Delta w = \begin{cases} A_+ \exp\!\left(-\frac{\Delta t}{\tau_+}\right) & \text{if } \Delta t > 0 \text{ (pre before post)} \\ -A_- \exp\!\left(\frac{\Delta t}{\tau_-}\right) & \text{if } \Delta t < 0 \text{ (post before pre)} \end{cases}

Kolmogorov-Arnold Network (KAN)

Learnable activation functions on edges, based on the Kolmogorov-Arnold representation theorem.

Architecture 2024 — Liu et al.

Kolmogorov-Arnold Representation Theorem

f(\mathbf{x}) = f(x_1, \dots, x_n) = \sum_{q=0}^{2n}\Phi_q\!\left(\sum_{p=1}^n \phi_{q,p}(x_p)\right)

KAN Layer

Each edge $(i, j)$ has a learnable univariate function $\phi_{ij}$, parameterized by B-splines:

\phi_{ij}(x) = w_b\,\text{SiLU}(x) + w_s\,\text{Spline}(x)$$ $$\text{Spline}(x) = \sum_k c_k\,B_k(x)

Layer Computation

x_j^{(l+1)} = \sum_{i=1}^{n_l} \phi_{ij}^{(l)}(x_i^{(l)})

Compared to MLPs which have fixed activations on nodes and learnable linear weights on edges, KANs have learnable nonlinear functions on edges and summation on nodes.

State Space Models (S4 / Mamba)

Sequence models based on continuous-time state space representations with efficient linear-time computation.

Sequence Modeling 2021 — Gu et al.

Continuous State Space

\frac{d\mathbf{h}(t)}{dt} = \mathbf{A}\,\mathbf{h}(t) + \mathbf{B}\,x(t)$$ $$y(t) = \mathbf{C}\,\mathbf{h}(t) + D\,x(t)

Discretization (Zero-Order Hold)

\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}) \approx (\mathbf{I} - \Delta\mathbf{A}/2)^{-1}(\mathbf{I} + \Delta\mathbf{A}/2)$$ $$\bar{\mathbf{B}} = (\Delta\mathbf{A})^{-1}(\bar{\mathbf{A}} - \mathbf{I})\cdot\Delta\mathbf{B}

Discrete Recurrence

\mathbf{h}_k = \bar{\mathbf{A}}\,\mathbf{h}_{k-1} + \bar{\mathbf{B}}\,x_k$$ $$y_k = \mathbf{C}\,\mathbf{h}_k + D\,x_k

Convolution Form

\bar{\mathbf{K}} = (\mathbf{C}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\; \dots,\; \mathbf{C}\bar{\mathbf{A}}^{L-1}\bar{\mathbf{B}})$$ $$\mathbf{y} = \bar{\mathbf{K}} * \mathbf{x}

Computed in $O(L \log L)$ via FFT during training.

HiPPO Initialization

A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}

Selective SSM (Mamba)

Makes parameters input-dependent for content-aware reasoning:

\mathbf{B}_k = s_B(\mathbf{x}_k), \quad \mathbf{C}_k = s_C(\mathbf{x}_k), \quad \Delta_k = \text{softplus}(s_\Delta(\mathbf{x}_k))

Hypernetworks

Networks that generate the weights of another network.

Meta-Learning 2016 — Ha, Dai, Le

Formulation

\boldsymbol{\theta} = h_\psi(\mathbf{z})$$ $$\hat{\mathbf{y}} = f_{\boldsymbol{\theta}}(\mathbf{x}) = f_{h_\psi(\mathbf{z})}(\mathbf{x})

The hypernetwork $h_\psi$ maps an embedding $\mathbf{z}$ (which can be task-specific, layer-specific, or input-dependent) to the parameters of the main network $f$.

Training

\mathcal{L}(\psi) = \mathbb{E}\!\left[\ell\!\left(f_{h_\psi(\mathbf{z})}(\mathbf{x}),\, \mathbf{y}\right)\right]$$ $$\nabla_\psi\mathcal{L} = \nabla_\theta\ell \cdot \frac{\partial h_\psi(\mathbf{z})}{\partial \psi}

Neural Cellular Automata

Learned local update rules that produce global emergent behavior.

Self-Organizing Morphogenesis 2020 — Mordvintsev et al.

Cell State Update

\text{Perception:}\quad \mathbf{p}_i = [\text{Sobel}_x * \mathbf{s}_i;\; \text{Sobel}_y * \mathbf{s}_i;\; \mathbf{s}_i]$$ $$\text{Update:}\quad \Delta\mathbf{s}_i = f_\theta(\mathbf{p}_i)$$ $$\text{Stochastic mask:}\quad m_i \sim \text{Bernoulli}(p)$$ $$\mathbf{s}_i^{t+1} = \mathbf{s}_i^t + m_i \cdot \Delta\mathbf{s}_i

All cells share the same neural network $f_\theta$, and the stochastic update mask enforces asynchrony for robustness.

Training via Differentiable Simulation

\mathcal{L} = \mathbb{E}_{t\sim[t_{\min}, t_{\max}]}\!\left[\|\mathbf{S}^{(t)} - \mathbf{S}_{\text{target}}\|^2\right]

Gradients are backpropagated through time across the simulation steps.

Neural Turing Machine / Differentiable Neural Computer

Neural networks augmented with external differentiable memory — capable of learning algorithms.

Supervised Algorithmic Reasoning 2014 — Graves et al.

Architecture

A controller network (LSTM or MLP) interacts with an external memory matrix $\mathbf{M} \in \mathbb{R}^{N \times M}$ via differentiable read/write heads:

Addressing — Content-Based

w_t^c(i) = \frac{\exp(\beta_t\, K[\mathbf{k}_t, \mathbf{M}_t(i)])}{\sum_j \exp(\beta_t\, K[\mathbf{k}_t, \mathbf{M}_t(j)])}$$ $$K[\mathbf{u}, \mathbf{v}] = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\,\|\mathbf{v}\|} \quad \text{(cosine similarity)}

Addressing — Location-Based

\mathbf{w}_t^g = g_t\,\mathbf{w}_t^c + (1-g_t)\,\mathbf{w}_{t-1} \quad \text{(interpolation)}$$ $$\tilde{w}_t(i) = \sum_{j=0}^{N-1} w_t^g(j)\, s_t(i - j) \quad \text{(convolutional shift)}$$ $$w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}} \quad \text{(sharpening)}

Read & Write

\text{Read:}\quad \mathbf{r}_t = \sum_i w_t^r(i)\,\mathbf{M}_t(i)$$ $$\text{Write:}\quad \mathbf{M}_t = \mathbf{M}_{t-1}\odot(\mathbf{1} - \mathbf{w}_t^w\,\mathbf{e}_t^\top) + \mathbf{w}_t^w\,\mathbf{a}_t^\top

$\mathbf{e}_t$ is the erase vector and $\mathbf{a}_t$ is the add vector.

Bayesian Neural Network

Placing probability distributions over weights for principled uncertainty quantification.

Probabilistic Uncertainty Estimation

Bayesian Inference over Weights

p(\boldsymbol{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{p(\mathcal{D})} = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{\int p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})\,d\boldsymbol{\theta}}

Predictive Distribution

p(\mathbf{y}^*|\mathbf{x}^*, \mathcal{D}) = \int p(\mathbf{y}^*|\mathbf{x}^*, \boldsymbol{\theta})\,p(\boldsymbol{\theta}|\mathcal{D})\,d\boldsymbol{\theta}$$ $$\approx \frac{1}{S}\sum_{s=1}^{S} p(\mathbf{y}^*|\mathbf{x}^*, \boldsymbol{\theta}^{(s)}), \quad \boldsymbol{\theta}^{(s)} \sim p(\boldsymbol{\theta}|\mathcal{D})

Variational Inference (Bayes by Backprop)

Approximate the intractable posterior with $q_\phi(\boldsymbol{\theta})$:

\mathcal{L}_{\text{VI}} = \text{KL}(q_\phi(\boldsymbol{\theta})\,\|\,p(\boldsymbol{\theta})) - \mathbb{E}_{q_\phi}[\log p(\mathcal{D}|\boldsymbol{\theta})]

With the reparameterization trick: $\theta_i = \mu_i + \sigma_i\,\epsilon$, $\epsilon\sim\mathcal{N}(0,1)$.

MC Dropout as Approximate BNN

\text{Var}[\mathbf{y}^*] \approx \frac{1}{T}\sum_{t=1}^{T}\hat{\mathbf{y}}_t^2 - \left(\frac{1}{T}\sum_{t=1}^T \hat{\mathbf{y}}_t\right)^2

Running $T$ forward passes with dropout enabled at test time provides uncertainty estimates.

Liquid Neural Network

Continuous-time neural networks with input-dependent dynamics — inspired by C. elegans neuroscience.

Neuromorphic Time Series / Robotics 2021 — Hasani et al. (MIT)

Liquid Time-Constant (LTC) Neuron

\frac{d\mathbf{h}(t)}{dt} = -\left[\frac{1}{\tau} + f_\theta(\mathbf{h}(t), \mathbf{x}(t))\right]\odot\mathbf{h}(t) + f_\theta(\mathbf{h}(t), \mathbf{x}(t))\odot A

The key insight: the time constant $\tau$ is modulated by the input, making dynamics input-dependent.

Neural Circuit Policy

f_\theta(\mathbf{h}, \mathbf{x}) = \sigma\!\left(\mathbf{W}\,[\mathbf{h};\,\mathbf{x}] + \mathbf{b}\right)$$ $$\tau_{\text{eff}}(t) = \frac{\tau}{1 + \tau\,f_\theta(\mathbf{h}(t), \mathbf{x}(t))}

Closed-Form Continuous-Depth (CfC)

An analytical solution avoiding ODE solvers:

\mathbf{h}(t) = \left(\mathbf{h}_0 - f_\infty\right)\odot\exp\!\left(-\frac{t}{\tau_{\text{eff}}}\right) + f_\infty

Where $f_\infty = A\,\sigma(\mathbf{W}[\mathbf{h}_0;\,\mathbf{x}] + \mathbf{b})$ is the steady-state.

Properties

Liquid networks are remarkably compact (19 neurons can drive a car) and inherently interpretable due to their neuroscience-inspired wiring.

Mixture Density Network

Predicting full conditional probability distributions using a mixture of Gaussians.

Supervised Multi-Modal Regression 1994 — Bishop

Output Parameterization

A neural network outputs the parameters of a Gaussian mixture model:

p(\mathbf{y}|\mathbf{x}) = \sum_{k=1}^{K}\pi_k(\mathbf{x})\,\mathcal{N}\!\left(\mathbf{y};\, \boldsymbol{\mu}_k(\mathbf{x}),\, \sigma_k^2(\mathbf{x})\mathbf{I}\right)

Network Outputs

\boldsymbol{\pi}(\mathbf{x}) = \text{softmax}(\mathbf{z}_\pi), \quad \sum_k\pi_k = 1$$ $$\boldsymbol{\mu}_k(\mathbf{x}) = \mathbf{z}_{\mu_k} \quad \text{(unconstrained)}$$ $$\sigma_k(\mathbf{x}) = \exp(\mathbf{z}_{\sigma_k}) \quad \text{(positive)}

Loss (Negative Log-Likelihood)

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log\sum_{k=1}^K \pi_k(\mathbf{x}_i)\,\mathcal{N}(\mathbf{y}_i;\, \boldsymbol{\mu}_k(\mathbf{x}_i), \sigma_k^2(\mathbf{x}_i))

MDNs can model one-to-many mappings (e.g., inverse kinematics, handwriting generation) where a single input maps to multiple valid outputs.

WaveNet

Autoregressive generative model with dilated causal convolutions for raw audio synthesis.

Generative Audio / Speech 2016 — van den Oord et al. (DeepMind)

Autoregressive Formulation

p(\mathbf{x}) = \prod_{t=1}^{T} p(x_t | x_1, \dots, x_{t-1})

Dilated Causal Convolutions

Stack convolutions with exponentially increasing dilation rates to grow the receptive field efficiently:

(f *_d x)_t = \sum_{k=0}^{K-1} f_k \cdot x_{t - d \cdot k}$$ $$\text{Dilations:}\quad d = 1, 2, 4, 8, \dots, 512 \quad \text{(repeated)}$$ $$\text{Receptive field} = \text{blocks} \times \sum_{l=0}^{L-1} 2^l \times (K-1) + 1

Gated Activation

\mathbf{z} = \tanh(\mathbf{W}_{f,k} * \mathbf{x}) \odot \sigma(\mathbf{W}_{g,k} * \mathbf{x})

Conditional WaveNet

\mathbf{z} = \tanh(\mathbf{W}_f * \mathbf{x} + \mathbf{V}_f * \mathbf{c}) \odot \sigma(\mathbf{W}_g * \mathbf{x} + \mathbf{V}_g * \mathbf{c})

Where $\mathbf{c}$ is a conditioning signal (e.g., mel spectrogram, speaker ID, linguistic features).

μ-Law Quantization

f(x_t) = \text{sign}(x_t)\frac{\ln(1 + \mu|x_t|)}{\ln(1+\mu)}, \quad \mu = 255

Compresses the 16-bit audio range into 256 values for categorical output via softmax.

Large Language Model (LLM) Architecture LLM DEEP DIVE

How modern LLMs like GPT-4, Claude, LLaMA, Gemini, and Mistral combine the neural network building blocks documented above into a single coherent system.

LLM Core Self-Supervised + RLHF Language / Multimodal 2017–present

Component Map — What LLMs Use

Every building block below is documented in detail in the sections above. An LLM is fundamentally a decoder-only Transformer composed of these pieces:

Transformer (Decoder-Only)

The core architecture. Stacked blocks with causal self-attention, preventing tokens from attending to future positions.

→ Section 13: Transformer

Scaled Dot-Product Attention

The fundamental operation: $\text{softmax}(\mathbf{QK}^\top/\sqrt{d_k})\mathbf{V}$. Every token attends to all previous tokens.

→ Section 12: Attention Mechanism

Multi-Head / GQA

Parallel attention heads capture different relationship types. GQA shares KV heads to reduce memory 4–8×.

→ Section 13: Multi-Head Attention

RoPE Positional Encoding

Rotary embeddings encode relative position directly into Q/K vectors. Used by LLaMA, Mistral, Claude, Gemma.

→ Section 13: RoPE

Feed-Forward Network (MLP)

Two-layer MLP at each position: project up 4×, apply SiLU/GELU, project back down. The "memory" of the model.

→ Section 2: MLP, Section 13: FFN

SiLU / GELU Activation

Smooth activations used inside transformer FFNs. SiLU (Swish) in LLaMA/Mistral; GELU in GPT/BERT.

→ Section 3: Activation Functions

RMSNorm / LayerNorm

Normalizes activations for training stability. Modern LLMs prefer RMSNorm (no mean subtraction, faster).

→ Section 6: Regularization

Residual Connections

Skip connections around every attention and FFN sublayer. Essential for training 100+ layer models.

→ Section 27: Residual Networks

AdamW Optimizer

Adam with decoupled weight decay. The standard optimizer for all LLM pretraining.

→ Section 5: Optimizers

Backpropagation

Gradient computation through billions of parameters. Combined with gradient checkpointing for memory efficiency.

→ Section 4: Backpropagation

Dropout (Optional)

Used in GPT-2/3 training. Many modern LLMs (LLaMA, PaLM) omit dropout entirely, relying on data scale.

→ Section 6: Regularization

SSM / Mamba (Hybrid)

Some architectures (Jamba, Zamba) combine transformer blocks with Mamba SSM layers for linear-time long sequences.

→ Section 32: State Space Models

Tokenization

LLMs do not operate on raw characters. Text is first split into subword tokens using algorithms like BPE (Byte Pair Encoding), which iteratively merges the most frequent byte pairs:

\text{BPE: merge}(a, b) = ab \quad\text{where}\quad (a,b) = \arg\max_{(x,y)} \text{count}(xy)$$ $$|\mathcal{V}| \;\text{typically}\; 32{,}000 \;\text{to}\; 128{,}000 \;\text{tokens}

Each token is mapped to an integer ID, which is then looked up in the embedding table.

Token & Positional Embeddings

\mathbf{h}_0 = \mathbf{E}_{\text{tok}}[x_1, x_2, \dots, x_n] + \mathbf{E}_{\text{pos}}$$ $$\mathbf{E}_{\text{tok}} \in \mathbb{R}^{|\mathcal{V}| \times d}, \quad d \;\text{typically}\; 4096\;\text{to}\;16384

Modern LLMs typically use RoPE instead of learned positional embeddings, applied directly to Q and K vectors inside each attention layer rather than added to the input.

Weight Tying

Many LLMs share the token embedding matrix with the output projection (language model head):

\text{logits} = \mathbf{h}_L \cdot \mathbf{E}_{\text{tok}}^\top \in \mathbb{R}^{n \times |\mathcal{V}|}

The LLM Transformer Block (Full Equations)

A modern LLM (e.g. LLaMA-style) stacks $L$ identical blocks. Each block performs:

Step 1: Pre-Norm + Causal Multi-Head Attention + Residual

\mathbf{x}' = \text{RMSNorm}(\mathbf{h}^{(l)})$$ $$\mathbf{Q} = \mathbf{x}'\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{x}'\mathbf{W}_K, \quad \mathbf{V} = \mathbf{x}'\mathbf{W}_V$$ $$\mathbf{Q} = \text{RoPE}(\mathbf{Q}), \quad \mathbf{K} = \text{RoPE}(\mathbf{K})$$ $$\text{Attn} = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}_{\text{causal}}\right)\mathbf{V}$$ $$\mathbf{h}^{(l)}_{\text{mid}} = \mathbf{h}^{(l)} + \text{MultiHead}(\text{Attn})

Step 2: Pre-Norm + SwiGLU FFN + Residual

\mathbf{x}'' = \text{RMSNorm}(\mathbf{h}^{(l)}_{\text{mid}})$$ $$\text{SwiGLU}(\mathbf{x}'') = (\mathbf{x}''\mathbf{W}_1 \odot \text{SiLU}(\mathbf{x}''\mathbf{W}_{\text{gate}}))\,\mathbf{W}_2$$ $$\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)}_{\text{mid}} + \text{SwiGLU}(\mathbf{x}'')

Where $\mathbf{W}_1, \mathbf{W}_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ and $\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$, with $d_{\text{ff}} \approx \frac{8}{3}d$ (SwiGLU adjustment).

Final Output

\mathbf{h}_{\text{final}} = \text{RMSNorm}(\mathbf{h}^{(L)})$$ $$P(x_{n+1} | x_1, \dots, x_n) = \text{softmax}(\mathbf{h}_{\text{final}}[n]\,\mathbf{W}_{\text{head}})

┌─────────────────────────────────────────────────────┐ │ FULL LLM FORWARD PASS │ ├─────────────────────────────────────────────────────┤ │ │ │ Tokens: [x₁, x₂, ..., xₙ] │ │ │ │ │ ▼ │ │ ┌──────────┐ │ │ │ Embedding │ h₀ = E_tok[tokens] │ │ └────┬─────┘ │ │ │ │ │ ▼ ×L layers │ │ ╔═══════════════════════════════════╗ │ │ ║ ┌──────────┐ ║ │ │ ║ │ RMSNorm │ ║ │ │ ║ └────┬─────┘ ║ │ │ ║ ▼ ║ │ │ ║ ┌──────────────────────┐ ║ │ │ ║ │ Causal Multi-Head │ ║ │ │ ║ │ Attention + RoPE │◄─ KV Cache │ │ ║ │ (with GQA) │ ║ │ │ ║ └────┬─────────────────┘ ║ │ │ ║ │ + residual ║ │ │ ║ ▼ ║ │ │ ║ ┌──────────┐ ║ │ │ ║ │ RMSNorm │ ║ │ │ ║ └────┬─────┘ ║ │ │ ║ ▼ ║ │ │ ║ ┌──────────────────────┐ ║ │ │ ║ │ SwiGLU FFN │ ║ │ │ ║ │ (W₁ ⊙ SiLU(W_gate)) │ ║ │ │ ║ │ × W₂ │ ║ │ │ ║ └────┬─────────────────┘ ║ │ │ ║ │ + residual ║ │ │ ╚═══════╪═══════════════════════════╝ │ │ ▼ │ │ ┌──────────┐ │ │ │ RMSNorm │ (final) │ │ └────┬─────┘ │ │ ▼ │ │ ┌──────────┐ │ │ │ LM Head │ logits = h · W_head │ │ └────┬─────┘ │ │ ▼ │ │ ┌──────────┐ │ │ │ Softmax │ → P(next token) │ │ └──────────┘ │ └─────────────────────────────────────────────────────┘

KV Cache (Inference Optimization)

During autoregressive generation, previously computed key and value vectors are cached to avoid redundant computation:

\text{At step } t: \quad \mathbf{K}_{\text{cache}} = [\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_t], \quad \mathbf{V}_{\text{cache}} = [\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_t]$$ $$\text{Only compute:}\quad \mathbf{q}_t = \mathbf{x}_t\mathbf{W}_Q, \quad \mathbf{k}_t = \mathbf{x}_t\mathbf{W}_K, \quad \mathbf{v}_t = \mathbf{x}_t\mathbf{W}_V$$ $$\text{Attend:}\quad \mathbf{o}_t = \text{softmax}\!\left(\frac{\mathbf{q}_t \mathbf{K}_{\text{cache}}^\top}{\sqrt{d_k}}\right)\mathbf{V}_{\text{cache}}

KV Cache Memory

\text{Memory} = 2 \times L \times n_{\text{kv\_heads}} \times d_k \times n_{\text{seq}} \times \text{bytes per param}

For a 70B model with 8K context in FP16: ~2–4 GB of KV cache per sequence.

PagedAttention (vLLM)

Manages KV cache as virtual memory pages to eliminate fragmentation and enable efficient batching of variable-length sequences.

LLM Training Pipeline

Phase 1: Pre-Training (Next Token Prediction)

\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_1, \dots, x_{t-1})$$ $$= -\sum_{t=1}^{T}\log\frac{\exp(\mathbf{h}_t^\top \mathbf{e}_{x_t})}{\sum_{v\in\mathcal{V}}\exp(\mathbf{h}_t^\top \mathbf{e}_v)}

Trained on trillions of tokens from web text, books, code, etc.

Phase 2: Supervised Fine-Tuning (SFT)

\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{response}} \log P_\theta(x_t | \text{prompt}, x_1, \dots, x_{t-1})

Only the response tokens contribute to the loss; prompt tokens are masked.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

Step 3a: Train a reward model $r_\phi$ on human preference pairs $(y_w \succ y_l)$:

\mathcal{L}_{\text{reward}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log\sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]

Step 3b: Optimize the policy with PPO, constrained by a KL penalty from the reference model $\pi_{\text{ref}}$:

\max_{\pi_\theta}\; \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi_\theta(y|x)}\!\left[r_\phi(x,y)\right] - \beta\,\text{KL}\!\left(\pi_\theta(y|x)\,\|\,\pi_{\text{ref}}(y|x)\right)

DPO (Direct Preference Optimization)

Bypasses the reward model entirely by reparameterizing the RLHF objective:

\mathcal{L}_{\text{DPO}} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

Scaling Laws

Kaplan Scaling (OpenAI, 2020)

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

Loss follows power laws in model parameters $N$, dataset size $D$, and compute $C$.

Chinchilla Optimal (Hoffmann et al., 2022)

N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$ $$\text{Rule of thumb:}\quad D \approx 20 \times N

For a given compute budget, model size and data should be scaled equally — a 10B model needs ~200B tokens.

LLM Parameter Count

N \approx 12\,L\,d^2 \quad\text{(for standard transformer with } d_{\text{ff}} = 4d\text{)}

Model	Layers $L$	Dim $d$	Heads $h$	Params
GPT-2	48	1600	25	1.5B
LLaMA-2 7B	32	4096	32	6.7B
LLaMA-2 70B	80	8192	64	70B
GPT-4 (est.)	120	~12288	96	~1.8T (MoE)

Mixture of Experts (MoE)

Replace the dense FFN with a sparse set of expert FFNs, routing each token to the top-$k$ experts:

\mathbf{g}(\mathbf{x}) = \text{softmax}(\mathbf{W}_g\,\mathbf{x}) \in \mathbb{R}^{E}$$ $$\text{TopK}(\mathbf{g}, k): \quad \mathcal{S} = \{i : g_i \text{ is in top-}k\}$$ $$\text{MoE}(\mathbf{x}) = \sum_{i \in \mathcal{S}} \frac{g_i(\mathbf{x})}{\sum_{j\in\mathcal{S}} g_j(\mathbf{x})}\cdot \text{FFN}_i(\mathbf{x})

Load Balancing Loss

\mathcal{L}_{\text{balance}} = \alpha\,E \sum_{i=1}^{E} f_i \cdot P_i$$ $$f_i = \frac{\text{tokens routed to expert } i}{\text{total tokens}}, \quad P_i = \frac{1}{T}\sum_{t=1}^T g_i(\mathbf{x}_t)

Encourages equal load across experts. Mixtral 8×7B uses $E=8$ experts with $k=2$, giving 47B total params but only ~13B active per token.

Sampling & Decoding Strategies

Temperature Scaling

P(x_t = v) = \frac{\exp(z_v / \tau)}{\sum_{v'}\exp(z_{v'} / \tau)}

$\tau \to 0$: greedy (argmax). $\tau = 1$: standard softmax. $\tau > 1$: more random.

Top-$k$ Sampling

P'(v) = \begin{cases} P(v) / \sum_{v' \in V_k} P(v') & \text{if } v \in V_k \\ 0 & \text{otherwise} \end{cases}

Top-$p$ (Nucleus) Sampling

V_p = \min\left\{V' \subseteq \mathcal{V} : \sum_{v \in V'} P(v) \geq p\right\}

Min-$p$ Sampling

V_{\min p} = \{v : P(v) \geq p_{\min} \cdot \max_{v'}P(v')\}

Beam Search

\text{score}(\mathbf{y}_{1:t}) = \frac{1}{t^\alpha}\sum_{i=1}^t \log P(y_i | y_1, \dots, y_{i-1})

Maintains top-$B$ candidates at each step, with length normalization exponent $\alpha$.

Repetition Penalty

z'_v = \begin{cases} z_v / \theta & \text{if } v \in \text{generated tokens and } z_v > 0 \\ z_v \cdot \theta & \text{if } v \in \text{generated tokens and } z_v \leq 0 \end{cases}

LoRA & Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation)

Freeze the pretrained weights and inject trainable low-rank decompositions:

\mathbf{W}' = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$ $$\mathbf{B} \in \mathbb{R}^{d \times r}, \quad \mathbf{A} \in \mathbb{R}^{r \times d}, \quad r \ll d$$ $$h = \mathbf{W}_0\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}

Typical $r = 8\text{–}64$, reducing trainable parameters by 1000× (e.g., 70B model → ~100M trainable params).

QLoRA

Combines LoRA with 4-bit quantized base weights using NormalFloat4 (NF4) data type:

\mathbf{W}_{\text{NF4}} = \text{quantize}_{4\text{bit}}(\mathbf{W}_0)$$ $$h = \text{dequant}(\mathbf{W}_{\text{NF4}})\,\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}

Fine-tune a 70B model on a single 48GB GPU.

Other PEFT Methods

Method	Approach	Trainable Params
Prefix Tuning	Learnable "virtual tokens" prepended to keys/values	~0.1%
Prompt Tuning	Learnable soft prompt embeddings	~0.01%
Adapters	Small bottleneck layers inserted between transformer sublayers	~1–3%
IA³	Learned vectors that rescale keys, values, and FFN activations	~0.01%

Long Context Techniques

RoPE Frequency Scaling

\theta_i' = \theta_i \cdot s^{-1} = \frac{10000^{-2i/d}}{s} \quad\text{(linear scaling, factor } s \text{)}

Extending a 4K model to 32K context uses $s = 8$.

YaRN (Yet another RoPE extensioN)

\theta_i' = \begin{cases} \theta_i & \text{if } \lambda_i < \lambda_{\min} \;\text{(high freq, no change)} \\ \theta_i / s & \text{if } \lambda_i > \lambda_{\max} \;\text{(low freq, full scale)} \\ (1-\gamma)\theta_i + \gamma\,\theta_i/s & \text{otherwise (interpolate)} \end{cases}

Flash Attention

IO-aware exact attention that avoids materializing the $n \times n$ attention matrix:

\text{Standard:}\quad O(n^2) \text{ memory}, \quad \text{Flash:}\quad O(n) \text{ memory}$$ $$\text{Compute:}\quad O(n^2 d) \;\text{(same)} \quad\text{but}\;\sim 2\text{–}4\times \text{faster via HBM reduction}

Uses online softmax and tiling to keep intermediate results in SRAM, avoiding slow HBM reads/writes.

Ring Attention

Distributes sequence across devices in a ring topology, overlapping communication with computation for near-infinite context:

\text{Effective context} = n_{\text{devices}} \times n_{\text{per\_device}}

Sliding Window Attention

\text{Attn}(i, j) = \begin{cases} \text{softmax}(\mathbf{q}_i\mathbf{k}_j^\top/\sqrt{d_k}) & \text{if } |i-j| \leq w \\ 0 & \text{otherwise} \end{cases}

Used in Mistral. With $L$ layers and window $w$, effective receptive field is $L \times w$ tokens.

Neural Network Encyclopedia — 48 architectures — Generated March 2026

Covers: Perceptron, MLP, CNN, U-Net, RNN, LSTM, GRU, xLSTM, Bidirectional RNN, Attention, Transformer, BERT, Seq2Seq, ViT, RWKV, Autoencoder, VAE, GAN, Diffusion, Normalizing Flows, Energy-Based, Siamese/Contrastive (SimCLR, CLIP), JEPA, GNN, Capsule, Hopfield, Boltzmann, RBM, RBF, SOM, ResNet, Neural ODE, Echo State, Spiking NN, KAN, SSM/Mamba, Hypernetworks, Neural Cellular Automata, Neural Turing Machine, Bayesian NN, Liquid NN, Mixture Density, WaveNet + LLM Architecture Deep Dive

Glossary of Neural Network Terms

13 key technical terms used throughout this guide.

A

Term	Definition
Activation Function	A non-linear function applied to neuron outputs (ReLU, sigmoid, tanh, GELU, SwiGLU). Without non-linearity, stacked layers would collapse to a single linear transformation.

B

Term	Definition
Backpropagation	The algorithm for computing gradients by applying the chain rule backwards through the computation graph. The fundamental mechanism for training neural networks.
Batch Normalization	Normalizing activations within a mini-batch to stabilize training. Reduces internal covariate shift. Used in CNNs; replaced by LayerNorm in Transformers.

C

Term	Definition
Convolutional Neural Network (CNN)	A neural network using convolutional filters for spatial pattern recognition. Dominant in computer vision. Key components: convolution, pooling, fully connected layers.

D

Term	Definition
Dropout	A regularization technique that randomly sets neuron outputs to zero during training with probability p. Prevents overfitting by reducing co-adaptation between neurons.

G

Term	Definition
Gradient Descent	An optimization algorithm that iteratively updates parameters in the direction of steepest loss decrease. Variants: SGD, Adam, AdamW. Learning rate controls step size.

L

Term	Definition
Learning Rate	The step size for parameter updates during optimization. Too high causes divergence; too low causes slow convergence. Scheduling (warmup, cosine decay) is critical for training stability.
Loss Function	A function measuring the difference between model predictions and target values. Cross-entropy for classification, MSE for regression, KL divergence for distribution matching.

O

Term	Definition
Overfitting	When a model memorizes training data patterns including noise, performing well on training data but poorly on unseen data. Prevented by regularization, dropout, early stopping, and data augmentation.

R

Term	Definition
Recurrent Neural Network (RNN)	A network with feedback connections that processes sequential data by maintaining hidden state across time steps. Variants: LSTM, GRU. Largely replaced by Transformers.
Regularization	Techniques preventing overfitting: L1/L2 weight penalty, dropout, data augmentation, early stopping, weight decay. Controls model complexity.

T

Term	Definition
Tensor	A multi-dimensional array — the fundamental data structure in deep learning. Scalars (0D), vectors (1D), matrices (2D), and higher-order tensors are all processed on GPUs.
Transfer Learning	Using a model pre-trained on one task as a starting point for another. Fine-tuning a pre-trained LLM is transfer learning. Dramatically reduces training data and compute requirements.

Full Reference: See the unified LLM Glossary for 140+ terms across all learning documents.