01
Perceptron
The simplest neural network — a single linear classifier.
Supervised
Classification
1958 — Rosenblatt
Forward Pass
Given input vector $\mathbf{x} \in \mathbb{R}^n$, weight vector $\mathbf{w} \in \mathbb{R}^n$, and bias $b$:
$$z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$
$$\hat{y} = \sigma(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}$$
Learning Rule
For a training sample $(\mathbf{x}, y)$ with learning rate $\eta$:
$$\mathbf{w} \leftarrow \mathbf{w} + \eta \,(y - \hat{y})\,\mathbf{x}$$
$$b \leftarrow b + \eta \,(y - \hat{y})$$
Convergence Theorem
If the training data is linearly separable with margin $\gamma = \min_i \frac{y_i(\mathbf{w}^{*\top}\mathbf{x}_i)}{\|\mathbf{w}^*\|}$, the perceptron converges in at most $\left(\frac{R}{\gamma}\right)^2$ updates, where $R = \max_i \|\mathbf{x}_i\|$.
x₁ ──w₁──╮
x₂ ──w₂──┤→ Σ + b → step(·) → ŷ
x₃ ──w₃──╯
02
Multi-Layer Perceptron (MLP)
Feedforward network with one or more hidden layers — a universal function approximator.
Supervised
Classification / Regression
Universal Approximation
Architecture
An MLP with $L$ layers maps input $\mathbf{x}$ through a series of affine transformations and nonlinearities:
$$\mathbf{h}^{(0)} = \mathbf{x}$$
$$\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad l = 1, \dots, L$$
$$\mathbf{h}^{(l)} = f\!\left(\mathbf{z}^{(l)}\right), \quad l = 1, \dots, L-1$$
$$\hat{\mathbf{y}} = g\!\left(\mathbf{z}^{(L)}\right)$$
Where $f$ is a hidden activation (e.g. ReLU) and $g$ is the output activation (e.g. softmax for classification, identity for regression).
Universal Approximation Theorem
A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$, given a non-polynomial activation function.
$$\forall\, \varepsilon > 0,\;\exists\, N,\; \mathbf{W}, \mathbf{b}:\quad \sup_{\mathbf{x} \in K} \left| f(\mathbf{x}) - \sum_{i=1}^{N} v_i \,\sigma\!\left(\mathbf{w}_i^\top \mathbf{x} + b_i\right) \right| < \varepsilon$$
Loss Functions
Mean Squared Error (Regression)
$$\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}\|\mathbf{y}_i - \hat{\mathbf{y}}_i\|^2$$
Cross-Entropy (Classification)
$$\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\,\log\hat{y}_{i,c}$$
Input Hidden 1 Hidden 2 Output
○─────╲
○──────●──────●──────●
○──────●──────●──────●──────○ ŷ
○──────●──────●──────●
○─────╱
03
Activation Functions
Nonlinearities that give neural networks their expressive power.
| Name | Formula $f(z)$ | Derivative $f'(z)$ |
| Sigmoid | $\frac{1}{1+e^{-z}}$ | $f(z)(1-f(z))$ |
| Tanh | $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ | $1 - f(z)^2$ |
| ReLU | $\max(0, z)$ | $\begin{cases}1 & z>0\\0 & z\leq 0\end{cases}$ |
| Leaky ReLU | $\max(\alpha z, z)$ | $\begin{cases}1 & z>0\\\alpha & z\leq 0\end{cases}$ |
| ELU | $\begin{cases}z & z>0\\\alpha(e^z-1) & z\leq 0\end{cases}$ | $\begin{cases}1 & z>0\\f(z)+\alpha & z\leq 0\end{cases}$ |
| GELU | $z \cdot \Phi(z)$ | $\Phi(z) + z\,\phi(z)$ |
| Swish / SiLU | $z \cdot \sigma(z)$ | $f(z) + \sigma(z)(1 - f(z))$ |
| Softmax | $\frac{e^{z_i}}{\sum_j e^{z_j}}$ | $f_i(\delta_{ij} - f_j)$ |
| Mish | $z \cdot \tanh(\ln(1+e^z))$ | See chain rule expansion |
GELU (Gaussian Error Linear Unit)
$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]$$
$$\approx 0.5\,z\left(1 + \tanh\!\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715\,z^3\right)\right]\right)$$
04
Backpropagation
The chain rule applied layer-by-layer to compute gradients efficiently.
Core Algorithm
1986 — Rumelhart, Hinton, Williams
Chain Rule (Vector Form)
For loss $\mathcal{L}$ with respect to parameters in layer $l$:
$$\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{z}^{(L)}}\mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}$$
$$\boldsymbol{\delta}^{(l)} = \left(\mathbf{W}^{(l+1)\top}\boldsymbol{\delta}^{(l+1)}\right) \odot f'\!\left(\mathbf{z}^{(l)}\right)$$
Parameter Gradients
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \mathbf{h}^{(l-1)\top}$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$
Computational Complexity
For a network with $L$ layers and $n$ neurons per layer, backpropagation has $O(Ln^2)$ time complexity — the same as the forward pass — making it highly efficient compared to numerical differentiation.
05
Optimization Algorithms
Methods for traversing the loss landscape to find good minima.
Stochastic Gradient Descent (SGD)
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)$$
SGD with Momentum
$$\mathbf{v}_{t+1} = \mu\, \mathbf{v}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)$$
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}$$
Nesterov Accelerated Gradient
$$\mathbf{v}_{t+1} = \mu\, \mathbf{v}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t + \mu\,\mathbf{v}_t)$$
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}$$
AdaGrad
$$\mathbf{G}_{t} = \mathbf{G}_{t-1} + \mathbf{g}_t^2$$
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{G}_t + \epsilon}}\, \mathbf{g}_t$$
RMSProp
$$\mathbf{v}_t = \rho\,\mathbf{v}_{t-1} + (1-\rho)\,\mathbf{g}_t^2$$
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}}\, \mathbf{g}_t$$
Adam
$$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t$$
$$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2$$
$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}$$
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}\,\hat{\mathbf{m}}_t$$
AdamW (Decoupled Weight Decay)
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda\,\boldsymbol{\theta}_t\right)$$
06
Regularization Techniques
Methods to prevent overfitting and improve generalization.
L1 Regularization (Lasso)
$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \sum_{l}\|\mathbf{W}^{(l)}\|_1 = \mathcal{L} + \lambda \sum_{l}\sum_{i,j}|W^{(l)}_{ij}|$$
L2 Regularization (Ridge / Weight Decay)
$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\sum_{l}\|\mathbf{W}^{(l)}\|_F^2 = \mathcal{L} + \frac{\lambda}{2}\sum_{l}\sum_{i,j}(W^{(l)}_{ij})^2$$
Dropout
During training, each neuron is independently set to zero with probability $p$:
$$\mathbf{m} \sim \text{Bernoulli}(1-p)$$
$$\tilde{\mathbf{h}}^{(l)} = \mathbf{m} \odot \mathbf{h}^{(l)}$$
$$\text{At test time:}\quad \mathbf{h}^{(l)}_{\text{test}} = (1-p)\,\mathbf{h}^{(l)}$$
Batch Normalization
$$\mu_B = \frac{1}{m}\sum_{i=1}^m z_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^m(z_i - \mu_B)^2$$
$$\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
$$y_i = \gamma\,\hat{z}_i + \beta$$
Layer Normalization
$$\mu = \frac{1}{H}\sum_{i=1}^{H}h_i, \quad \sigma^2 = \frac{1}{H}\sum_{i=1}^{H}(h_i - \mu)^2$$
$$\hat{h}_i = \frac{h_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma\,\hat{h}_i + \beta$$
RMSNorm
$$\text{RMS}(\mathbf{h}) = \sqrt{\frac{1}{H}\sum_{i=1}^H h_i^2}$$
$$\hat{h}_i = \frac{h_i}{\text{RMS}(\mathbf{h})}\,\gamma_i$$
07
Convolutional Neural Network (CNN)
Networks exploiting spatial structure through shared local filters.
Supervised
Computer Vision
1989 — LeCun
2D Convolution
For input $\mathbf{X} \in \mathbb{R}^{C_{in} \times H \times W}$ and filter $\mathbf{K} \in \mathbb{R}^{C_{in} \times k \times k}$:
$$(\mathbf{X} * \mathbf{K})[i,j] = \sum_{c=1}^{C_{in}}\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} X[c,\, i+m,\, j+n] \cdot K[c,\, m,\, n]$$
Output Dimensions
$$H_{\text{out}} = \left\lfloor\frac{H + 2p - k}{s}\right\rfloor + 1, \quad W_{\text{out}} = \left\lfloor\frac{W + 2p - k}{s}\right\rfloor + 1$$
Where $p$ is padding, $s$ is stride, $k$ is kernel size.
Depthwise Separable Convolution
Factorizes a standard convolution into a depthwise and pointwise step:
$$\text{Standard cost:}\quad C_{in} \cdot k^2 \cdot C_{out} \cdot H' \cdot W'$$
$$\text{Separable cost:}\quad C_{in} \cdot k^2 \cdot H' \cdot W' + C_{in} \cdot C_{out} \cdot H' \cdot W'$$
$$\text{Reduction ratio:}\quad \frac{1}{C_{out}} + \frac{1}{k^2}$$
Dilated (Atrous) Convolution
$$(\mathbf{X} *_d \mathbf{K})[i,j] = \sum_{m}\sum_{n} X[i + d \cdot m,\; j + d \cdot n] \cdot K[m, n]$$
Effective receptive field: $k + (k-1)(d-1)$, where $d$ is the dilation rate.
Pooling Operations
$$\text{Max Pool:}\quad y_{ij} = \max_{(m,n) \in \mathcal{R}_{ij}} x_{mn}$$
$$\text{Avg Pool:}\quad y_{ij} = \frac{1}{|\mathcal{R}_{ij}|}\sum_{(m,n) \in \mathcal{R}_{ij}} x_{mn}$$
Transposed Convolution
Used for upsampling. Equivalent to convolving with fractional strides or padding the input:
$$H_{\text{out}} = (H_{\text{in}} - 1) \cdot s - 2p + k + p_{\text{out}}$$
08
U-Net
Encoder-decoder with skip connections for dense prediction tasks. Backbone of all modern diffusion models.
Supervised
Segmentation / Diffusion
2015 — Ronneberger et al.
Architecture
Symmetric encoder (contracting) and decoder (expanding) path with skip connections concatenating encoder features to decoder features at each resolution:
$$\text{Encoder:}\quad \mathbf{e}^{(l)} = \text{MaxPool}\!\left(\text{ConvBlock}(\mathbf{e}^{(l-1)})\right)$$
$$\text{Decoder:}\quad \mathbf{d}^{(l)} = \text{ConvBlock}\!\left([\text{UpConv}(\mathbf{d}^{(l+1)});\; \mathbf{e}^{(l)}]\right)$$
Where $[\cdot;\cdot]$ denotes channel-wise concatenation (skip connection).
ConvBlock
$$\text{ConvBlock}(\mathbf{x}) = \text{ReLU}(\text{BN}(\text{Conv}_{3\times3}(\text{ReLU}(\text{BN}(\text{Conv}_{3\times3}(\mathbf{x}))))))$$
U-Net in Diffusion Models
In DDPM / Stable Diffusion, the U-Net is conditioned on timestep $t$ and optional conditioning $c$:
$$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) = \text{
U-Net}(\mathbf{x}_t,\; \text{SinEmb}(t),\; \text{CrossAttn}(c))$$
Time embeddings are injected via addition/FiLM layers; text conditioning via cross-attention at each resolution level.
Encoder Decoder
x ──[Conv]──┐ ┌──[Conv]── ŷ
64 │ skip │ 64
[Pool] ├───────────────→┤ [UpConv]
──[Conv]──┐ │ │ ┌──[Conv]──
128 │ │ skip │ │ 128
[Pool] ├─┼───────────────→┼─┤ [UpConv]
─[Conv]─┐ │ │ │ │ ┌─[Conv]─
256 │ │ │ skip │ │ │ 256
[Pool] ├─┼─┼──────────────→┼─┼─┤[UpConv]
─[Conv]─╯ │ │ Bottleneck │ │ ╰─[Conv]─
512 │ │ ────[Conv]── │ │ 512
09
Recurrent Neural Network (Vanilla RNN)
Networks with temporal memory via recurrent connections.
Supervised
Sequence Modeling
Temporal
Hidden State Dynamics
$$\mathbf{h}_t = \tanh\!\left(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h\right)$$
$$\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y$$
Backpropagation Through Time (BPTT)
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t} \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}$$
Vanishing/Exploding Gradient Problem
The product of Jacobians $\prod_j \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}$ can shrink or grow exponentially:
$$\left\|\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right\| \leq \left(\|\mathbf{W}_{hh}\| \cdot \gamma\right)^{t-k}$$
Where $\gamma = \max |f'(z)|$. If $\|\mathbf{W}_{hh}\| \cdot \gamma < 1$, gradients vanish; if $> 1$, they explode.
10
Long Short-Term Memory (LSTM)
Gated RNN architecture solving the vanishing gradient problem.
Supervised
Sequence Modeling
1997 — Hochreiter & Schmidhuber
Gate Equations
$$\mathbf{f}_t = \sigma\!\left(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f\right) \quad \text{(forget gate)}$$
$$\mathbf{i}_t = \sigma\!\left(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i\right) \quad \text{(input gate)}$$
$$\tilde{\mathbf{c}}_t = \tanh\!\left(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c\right) \quad \text{(candidate)}$$
$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell state)}$$
$$\mathbf{o}_t = \sigma\!\left(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o\right) \quad \text{(output gate)}$$
$$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$
Gradient Flow through Cell State
The cell state provides a highway for gradients:
$$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t)$$
$$\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_k} = \prod_{j=k+1}^{T}\text{diag}(\mathbf{f}_j)$$
When $\mathbf{f}_t \approx 1$, gradients flow unattenuated over many timesteps.
Parameter Count
$$\text{Params} = 4\left[(d_h + d_x)\cdot d_h + d_h\right]$$
11
Gated Recurrent Unit (GRU)
A simplified gating mechanism merging cell and hidden state.
Supervised
Sequence Modeling
2014 — Cho et al.
Gate Equations
$$\mathbf{r}_t = \sigma\!\left(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r\right) \quad \text{(reset gate)}$$
$$\mathbf{z}_t = \sigma\!\left(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z\right) \quad \text{(update gate)}$$
$$\tilde{\mathbf{h}}_t = \tanh\!\left(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h\right)$$
$$\mathbf{h}_t = (1 - \mathbf{z}_t)\odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$$
Parameter Count
$$\text{Params} = 3\left[(d_h + d_x)\cdot d_h + d_h\right]$$
GRU uses 25% fewer parameters than LSTM (3 gates vs 4).
12
Extended LSTM (xLSTM)
Modernized LSTM with exponential gating and matrix memory for LLM-scale performance.
Supervised
Sequence Modeling
2024 — Beck et al.
sLSTM (Scalar Memory)
Extends LSTM with exponential gating and a normalizer state for numerical stability:
$$\mathbf{f}_t = \exp\!\left(\mathbf{w}_f^\top \mathbf{x}_t + b_f\right) \quad \text{(exponential forget gate)}$$
$$\mathbf{i}_t = \exp\!\left(\mathbf{w}_i^\top \mathbf{x}_t + b_i\right) \quad \text{(exponential input gate)}$$
$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$
$$\mathbf{n}_t = \mathbf{f}_t \odot \mathbf{n}_{t-1} + \mathbf{i}_t \quad \text{(normalizer state)}$$
$$\mathbf{h}_t = \mathbf{o}_t \odot \frac{\mathbf{c}_t}{\mathbf{n}_t}$$
mLSTM (Matrix Memory)
Replaces the scalar cell state with a matrix $\mathbf{C}_t \in \mathbb{R}^{d \times d}$, enabling key-value storage:
$$\mathbf{k}_t = \mathbf{W}_k\mathbf{x}_t, \quad \mathbf{v}_t = \mathbf{W}_v\mathbf{x}_t, \quad \mathbf{q}_t = \mathbf{W}_q\mathbf{x}_t$$
$$\mathbf{C}_t = f_t\,\mathbf{C}_{t-1} + i_t\,\mathbf{v}_t\mathbf{k}_t^\top$$
$$\mathbf{n}_t = f_t\,\mathbf{n}_{t-1} + i_t\,\mathbf{k}_t$$
$$\mathbf{h}_t = \mathbf{o}_t \odot \frac{\mathbf{C}_t\,\mathbf{q}_t}{\max(|\mathbf{n}_t^\top\mathbf{q}_t|, 1)}$$
mLSTM is fully parallelizable (no hidden-to-hidden recurrence) and can be viewed as a linearized self-attention with a decay factor.
13
Bidirectional RNN
Processing sequences in both forward and backward directions.
Sequence Modeling
1997 — Schuster & Paliwal
Architecture
$$\overrightarrow{\mathbf{h}}_t = f\!\left(\mathbf{W}_{\overrightarrow{h}}\,\overrightarrow{\mathbf{h}}_{t-1} + \mathbf{W}_{x\overrightarrow{h}}\,\mathbf{x}_t + \mathbf{b}_{\overrightarrow{h}}\right)$$
$$\overleftarrow{\mathbf{h}}_t = f\!\left(\mathbf{W}_{\overleftarrow{h}}\,\overleftarrow{\mathbf{h}}_{t+1} + \mathbf{W}_{x\overleftarrow{h}}\,\mathbf{x}_t + \mathbf{b}_{\overleftarrow{h}}\right)$$
$$\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\, \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2d_h}$$
$$\mathbf{y}_t = \mathbf{W}_y\,\mathbf{h}_t + \mathbf{b}_y$$
14
Attention Mechanism
Learning to focus on relevant parts of the input.
Core Mechanism
2014 — Bahdanau et al.
Additive (Bahdanau) Attention
$$e_{ij} = \mathbf{v}^\top \tanh\!\left(\mathbf{W}_1\mathbf{h}_i + \mathbf{W}_2\mathbf{s}_j\right)$$
$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k}\exp(e_{kj})}$$
$$\mathbf{c}_j = \sum_i \alpha_{ij}\,\mathbf{h}_i$$
Multiplicative (Luong) Attention
$$e_{ij} = \mathbf{s}_j^\top \mathbf{W} \mathbf{h}_i \quad\text{(general)}$$
$$e_{ij} = \mathbf{s}_j^\top \mathbf{h}_i \quad\text{(dot)}$$
Scaled Dot-Product Attention
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$$
The $\sqrt{d_k}$ scaling prevents softmax saturation when dot products grow large.
15
Transformer
Attention-only architecture that revolutionized NLP and beyond.
Supervised / Self-Supervised
NLP / Vision / Multimodal
2017 — Vaswani et al.
Multi-Head Attention
$$\mathbf{Q}_i = \mathbf{X}\mathbf{W}_i^Q,\quad \mathbf{K}_i = \mathbf{X}\mathbf{W}_i^K,\quad \mathbf{V}_i = \mathbf{X}\mathbf{W}_i^V$$
$$\text{head}_i = \text{Attention}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i)$$
$$\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\,\mathbf{W}^O$$
Where $\mathbf{W}_i^Q, \mathbf{W}_i^K \in \mathbb{R}^{d \times d_k}$, $\mathbf{W}_i^V \in \mathbb{R}^{d \times d_v}$, $\mathbf{W}^O \in \mathbb{R}^{hd_v \times d}$.
Sinusoidal Positional Encoding
$$\text{PE}_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right)$$
$$\text{PE}_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$
Rotary Positional Embedding (RoPE)
$$\mathbf{R}_\theta^{(m)} = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 \\ \sin m\theta_1 & \cos m\theta_1 \\ & & \cos m\theta_2 & -\sin m\theta_2 \\ & & \sin m\theta_2 & \cos m\theta_2 \\ & & & & \ddots \end{pmatrix}$$
$$\mathbf{q}_m^\top \mathbf{k}_n = (\mathbf{R}_\theta^{(m)}\mathbf{W}_q \mathbf{x}_m)^\top (\mathbf{R}_\theta^{(n)}\mathbf{W}_k \mathbf{x}_n)$$
Feed-Forward Network (per position)
$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2\,\text{GELU}(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$
Encoder Block
$$\mathbf{x}' = \text{LayerNorm}(\mathbf{x} + \text{MultiHead}(\mathbf{x}))$$
$$\mathbf{x}'' = \text{LayerNorm}(\mathbf{x}' + \text{FFN}(\mathbf{x}'))$$
Decoder Block (with causal mask)
The causal mask $\mathbf{M}$ sets future positions to $-\infty$ before softmax:
$$M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$$
$$\text{CausalAttn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}$$
Grouped-Query Attention (GQA)
Shares key-value heads across groups of query heads to reduce memory:
$$\text{KV heads} = \frac{h}{G}, \quad \text{Each KV head serves } G \text{ query heads}$$
Computational Complexity
$$\text{Self-Attention:}\quad O(n^2 \cdot d)$$
$$\text{FFN:}\quad O(n \cdot d^2)$$
$$\text{Total per layer:}\quad O(n^2 d + n d^2)$$
16
BERT (Encoder-Only Transformer)
Bidirectional encoder pre-trained with masked language modeling — the foundation for NLU tasks.
Self-Supervised → Supervised
NLU / Classification / NER
2018 — Devlin et al.
Masked Language Modeling (MLM)
Randomly mask 15% of input tokens and predict the originals:
$$\mathcal{L}_{\text{MLM}} = -\mathbb{E}\!\left[\sum_{i \in \mathcal{M}} \log P_\theta(x_i | \mathbf{x}_{\backslash\mathcal{M}})\right]$$
Of the 15% selected: 80% are replaced with [MASK], 10% with a random token, 10% unchanged.
Next Sentence Prediction (NSP)
$$P(\text{IsNext} | [\text{CLS}]\, A\, [\text{SEP}]\, B) = \sigma(\mathbf{w}^\top \mathbf{h}_{[\text{CLS}]} + b)$$
Input Representation
$$\mathbf{h}_0 = \mathbf{E}_{\text{tok}}(\mathbf{x}) + \mathbf{E}_{\text{pos}} + \mathbf{E}_{\text{seg}}$$
Segment embeddings distinguish sentence A from B. The [CLS] token representation is used for classification tasks.
Fine-Tuning
$$\text{Classification:}\quad \hat{y} = \text{softmax}(\mathbf{W}\,\mathbf{h}_{[\text{CLS}]} + \mathbf{b})$$
$$\text{Token-level (NER):}\quad \hat{y}_i = \text{softmax}(\mathbf{W}\,\mathbf{h}_i + \mathbf{b})$$
| Model | Layers | Hidden | Heads | Params |
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
| RoBERTa | 24 | 1024 | 16 | 355M |
17
Sequence-to-Sequence (Encoder-Decoder)
Mapping variable-length input sequences to variable-length output sequences.
Supervised
Translation / Summarization
2014 — Sutskever et al.
RNN-Based Seq2Seq
$$\text{Encoder:}\quad \mathbf{h}_t^{\text{enc}} = f_{\text{enc}}(\mathbf{x}_t, \mathbf{h}_{t-1}^{\text{enc}})$$
$$\mathbf{c} = \mathbf{h}_T^{\text{enc}} \quad \text{(context vector = final encoder state)}$$
$$\text{Decoder:}\quad \mathbf{h}_t^{\text{dec}} = f_{\text{dec}}(y_{t-1}, \mathbf{h}_{t-1}^{\text{dec}}, \mathbf{c})$$
$$P(y_t | y_{<t}, \mathbf{x}) = \text{softmax}(\mathbf{W}_o\,\mathbf{h}_t^{\text{dec}})$$
Seq2Seq with Attention
$$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j\exp(e_{tj})}, \quad e_{ti} = \text{score}(\mathbf{h}_t^{\text{dec}}, \mathbf{h}_i^{\text{enc}})$$
$$\mathbf{c}_t = \sum_i \alpha_{ti}\,\mathbf{h}_i^{\text{enc}}$$
$$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t;\,\mathbf{h}_t^{\text{dec}}])$$
Transformer Encoder-Decoder (T5 / BART)
$$\text{Encoder:}\quad \mathbf{H}^{\text{enc}} = \text{TransformerEncoder}(\mathbf{x})$$
$$\text{Decoder layer:}\quad \mathbf{h}' = \text{CausalSelfAttn}(\mathbf{h}) + \mathbf{h}$$
$$\mathbf{h}'' = \text{CrossAttn}(\mathbf{h}', \mathbf{H}^{\text{enc}}) + \mathbf{h}'$$
$$\mathbf{h}''' = \text{FFN}(\mathbf{h}'') + \mathbf{h}''$$
Teacher Forcing
$$\text{Training:}\quad \hat{y}_t = f(y_{t-1}^{\text{gold}}, \mathbf{h}_{t-1}) \quad \text{(use ground truth as input)}$$
$$\text{Inference:}\quad \hat{y}_t = f(\hat{y}_{t-1}, \mathbf{h}_{t-1}) \quad \text{(use model's own prediction)}$$
18
Vision Transformer (ViT)
Applying the transformer architecture directly to image patches.
Supervised / Self-Supervised
Computer Vision
2020 — Dosovitskiy et al.
Patch Embedding
An image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ is split into $N$ patches of size $P \times P$:
$$N = \frac{H \cdot W}{P^2}$$
$$\mathbf{x}_p^{(i)} \in \mathbb{R}^{P^2 \cdot C} \quad \text{(flattened patch } i\text{)}$$
$$\mathbf{z}_0^{(i)} = \mathbf{x}_p^{(i)}\,\mathbf{E} + \mathbf{e}_{\text{pos}}^{(i)}, \quad \mathbf{E} \in \mathbb{R}^{(P^2 C) \times d}$$
CLS Token
$$\mathbf{z}_0 = [\mathbf{x}_{\text{cls}};\; \mathbf{z}_0^{(1)};\; \mathbf{z}_0^{(2)};\; \dots;\; \mathbf{z}_0^{(N)}] + \mathbf{E}_{\text{pos}}$$
$$\hat{y} = \text{MLP}(\text{LayerNorm}(\mathbf{z}_L^{(0)}))$$
Full Forward Pass
$$\mathbf{z}'_l = \text{MSA}(\text{LN}(\mathbf{z}_{l-1})) + \mathbf{z}_{l-1}$$
$$\mathbf{z}_l = \text{FFN}(\text{LN}(\mathbf{z}'_l)) + \mathbf{z}'_l$$
Variants
| Model | Patch Size | Layers | Hidden | Heads | Params |
| ViT-B/16 | 16 | 12 | 768 | 12 | 86M |
| ViT-L/16 | 16 | 24 | 1024 | 16 | 307M |
| ViT-H/14 | 14 | 32 | 1280 | 16 | 632M |
19
RWKV (Receptance Weighted Key Value)
Linear-complexity RNN that matches transformer quality — trainable like a transformer, runs like an RNN.
Self-Supervised
Language Modeling
2023 — Peng et al.
Time Mixing (Attention Replacement)
$$\mathbf{r}_t = \mathbf{W}_r(\mu_r \odot \mathbf{x}_t + (1-\mu_r)\odot\mathbf{x}_{t-1})$$
$$\mathbf{k}_t = \mathbf{W}_k(\mu_k \odot \mathbf{x}_t + (1-\mu_k)\odot\mathbf{x}_{t-1})$$
$$\mathbf{v}_t = \mathbf{W}_v(\mu_v \odot \mathbf{x}_t + (1-\mu_v)\odot\mathbf{x}_{t-1})$$
WKV Mechanism (Linear Attention)
$$\text{wkv}_t = \frac{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_i}\mathbf{v}_i + e^{u+k_t}\mathbf{v}_t}{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_i} + e^{u+k_t}}$$
$$\mathbf{o}_t = \sigma(\mathbf{r}_t) \odot \text{wkv}_t$$
Where $w$ is a learned decay vector and $u$ is a learned bonus for the current token. This can be computed recurrently in $O(1)$ per step.
Channel Mixing (FFN Replacement)
$$\mathbf{r}_t' = \sigma(\mathbf{W}_{r'}(\mu_{r'}\odot\mathbf{x}_t + (1-\mu_{r'})\odot\mathbf{x}_{t-1}))$$
$$\mathbf{k}_t' = \mathbf{W}_{k'}(\mu_{k'}\odot\mathbf{x}_t + (1-\mu_{k'})\odot\mathbf{x}_{t-1})$$
$$\mathbf{o}_t = \mathbf{r}_t' \odot (\mathbf{W}_v'\,\max(\mathbf{k}_t', 0)^2)$$
Complexity
$$\text{Training:}\quad O(Td) \quad\text{(parallelizable like transformer)}$$
$$\text{Inference:}\quad O(d) \text{ per token} \quad\text{(constant, like RNN)}$$
20
Autoencoder
Learning compressed representations via reconstruction.
Unsupervised
Representation Learning
Architecture
$$\text{Encoder:}\quad \mathbf{z} = f_\phi(\mathbf{x}) = \sigma(\mathbf{W}_e\mathbf{x} + \mathbf{b}_e)$$
$$\text{Decoder:}\quad \hat{\mathbf{x}} = g_\theta(\mathbf{z}) = \sigma(\mathbf{W}_d\mathbf{z} + \mathbf{b}_d)$$
$$\mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2$$
Denoising Autoencoder
$$\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I})$$
$$\mathcal{L}_{\text{DAE}} = \|\mathbf{x} - g_\theta(f_\phi(\tilde{\mathbf{x}}))\|^2$$
Sparse Autoencoder
$$\mathcal{L}_{\text{sparse}} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 + \lambda \sum_j \text{KL}(\rho \,\|\, \hat{\rho}_j)$$
$$\text{KL}(\rho\,\|\,\hat{\rho}_j) = \rho\log\frac{\rho}{\hat{\rho}_j} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j}$$
21
Variational Autoencoder (VAE)
Probabilistic generative model with a learned latent space.
Generative
Latent Variable Model
2013 — Kingma & Welling
Generative Model
$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z})\,p(\mathbf{z})\,d\mathbf{z}, \quad p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$$
Evidence Lower Bound (ELBO)
$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \text{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z})\right) = \text{ELBO}$$
Reparameterization Trick
$$q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}\!\left(\boldsymbol{\mu}_\phi(\mathbf{x}),\, \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x}))\right)$$
$$\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$
KL Divergence (Closed Form for Gaussians)
$$\text{KL}(q\,\|\,p) = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$
Loss Function
$$\mathcal{L}_{\text{VAE}} = -\mathbb{E}_{q_\phi}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] + \text{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x})\,\|\,p(\mathbf{z})\right)$$
22
Generative Adversarial Network (GAN)
Two networks competing in a minimax game to generate realistic data.
Generative
Image Synthesis
2014 — Goodfellow et al.
Minimax Objective
$$\min_G \max_D\; V(D, G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\!\left[\log D(\mathbf{x})\right] + \mathbb{E}_{\mathbf{z}\sim p_z}\!\left[\log(1 - D(G(\mathbf{z})))\right]$$
Optimal Discriminator
$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$
Global Optimum
At the Nash equilibrium, $p_g = p_{\text{data}}$ and $D^*(\mathbf{x}) = \frac{1}{2}$:
$$V(D^*, G^*) = -\log 4$$
$$C(G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \,\|\, p_g)$$
Wasserstein GAN (WGAN)
$$W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1}\; \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{x}\sim p_g}[f(\mathbf{x})]$$
The critic $f$ (replacing the discriminator) is enforced to be 1-Lipschitz via gradient penalty:
$$\mathcal{L}_{\text{GP}} = \lambda\,\mathbb{E}_{\hat{\mathbf{x}}}\!\left[\left(\|\nabla_{\hat{\mathbf{x}}} f(\hat{\mathbf{x}})\|_2 - 1\right)^2\right]$$
$$\hat{\mathbf{x}} = \alpha\,\mathbf{x}_{\text{real}} + (1-\alpha)\,\mathbf{x}_{\text{fake}},\quad \alpha\sim U[0,1]$$
23
Diffusion Models (DDPM)
Generating data by learning to reverse a gradual noising process.
Generative
Image / Audio / Video
2020 — Ho, Jain, Abbeel
Forward Process (Diffusion)
$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t\mathbf{I}\right)$$
$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\, (1-\bar{\alpha}_t)\mathbf{I}\right)$$
Where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.
Reverse Process
$$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2\mathbf{I}\right)$$
$$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right)$$
Training Objective (Simplified)
$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\right)\right\|^2\right]$$
Score-Based Formulation
$$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) = -\sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t) = -\sqrt{1-\bar{\alpha}_t}\,\mathbf{s}_\theta(\mathbf{x}_t, t)$$
Classifier-Free Guidance
$$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, c) = (1+w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$$
24
Normalizing Flows
Exact likelihood models using invertible transformations.
Generative
Exact Likelihood
Change of Variables
$$\mathbf{x} = f(\mathbf{z}), \quad \mathbf{z} = f^{-1}(\mathbf{x})$$
$$\log p_\mathbf{x}(\mathbf{x}) = \log p_\mathbf{z}(f^{-1}(\mathbf{x})) + \log\left|\det\frac{\partial f^{-1}}{\partial \mathbf{x}}\right|$$
Composition of Flows
$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})$$
$$\log p(\mathbf{x}) = \log p(\mathbf{z}) - \sum_{k=1}^{K}\log\left|\det\frac{\partial f_k}{\partial \mathbf{h}_{k-1}}\right|$$
Coupling Layer (RealNVP)
$$\mathbf{x}_{1:d} = \mathbf{z}_{1:d}$$
$$\mathbf{x}_{d+1:D} = \mathbf{z}_{d+1:D} \odot \exp\!\left(s(\mathbf{z}_{1:d})\right) + t(\mathbf{z}_{1:d})$$
The Jacobian is triangular, so $\det = \prod \exp(s_i) = \exp(\sum s_i)$, computed in $O(D)$.
25
Energy-Based Models
Defining probability distributions via scalar energy functions.
Generative
Unnormalized
Energy Function
$$p_\theta(\mathbf{x}) = \frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \quad Z_\theta = \int \exp(-E_\theta(\mathbf{x}))\,d\mathbf{x}$$
Score Matching
$$\mathcal{L}_{\text{SM}} = \mathbb{E}_{p_{\text{data}}}\!\left[\frac{1}{2}\|\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\|^2 + \text{tr}(\nabla^2_\mathbf{x} \log p_\theta(\mathbf{x}))\right]$$
Contrastive Divergence
$$\nabla_\theta \log p_\theta(\mathbf{x}) = -\nabla_\theta E_\theta(\mathbf{x}) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(\mathbf{x})]$$
$$\approx -\nabla_\theta E_\theta(\mathbf{x}_{\text{data}}) + \nabla_\theta E_\theta(\tilde{\mathbf{x}})$$
Where $\tilde{\mathbf{x}}$ is obtained from a few steps of MCMC starting from data.
26
Siamese Networks & Contrastive Learning
Learning representations by comparing pairs or groups of inputs — foundation of CLIP, SimCLR, and modern self-supervised vision.
Self-Supervised
Representation Learning
1993 — Bromley et al. / 2020 — Chen et al.
Siamese Network
Two identical networks sharing weights process two inputs and compare their embeddings:
$$\mathbf{z}_1 = f_\theta(\mathbf{x}_1), \quad \mathbf{z}_2 = f_\theta(\mathbf{x}_2)$$
$$d(\mathbf{x}_1, \mathbf{x}_2) = \|\mathbf{z}_1 - \mathbf{z}_2\|_2$$
Contrastive Loss
$$\mathcal{L}_{\text{contrastive}} = (1-y)\frac{1}{2}d^2 + y\frac{1}{2}\max(0, m - d)^2$$
Where $y=0$ for similar pairs, $y=1$ for dissimilar, and $m$ is the margin.
Triplet Loss
$$\mathcal{L}_{\text{triplet}} = \max\!\left(0,\; \|f(\mathbf{x}_a) - f(\mathbf{x}_p)\|^2 - \|f(\mathbf{x}_a) - f(\mathbf{x}_n)\|^2 + \alpha\right)$$
NT-Xent Loss (SimCLR)
Normalized temperature-scaled cross-entropy over a batch of $2N$ augmented pairs:
$$\text{sim}(\mathbf{z}_i, \mathbf{z}_j) = \frac{\mathbf{z}_i^\top\mathbf{z}_j}{\|\mathbf{z}_i\|\,\|\mathbf{z}_j\|}$$
$$\ell_{i,j} = -\log\frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau)}{\sum_{k=1}^{2N}\mathbf{1}_{[k\neq i]}\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau)}$$
CLIP (Contrastive Language-Image Pre-training)
Aligns image and text embeddings using a symmetric contrastive loss over a batch of $N$ image-text pairs:
$$\mathbf{z}_I = f_{\text{image}}(\mathbf{x}_I), \quad \mathbf{z}_T = f_{\text{text}}(\mathbf{x}_T)$$
$$\text{logits} = \mathbf{Z}_I\,\mathbf{Z}_T^\top \cdot e^\tau$$
$$\mathcal{L}_{\text{
CLIP}} = \frac{1}{2}\left(\text{CE}(\text{logits},\, \mathbf{I}_N) + \text{CE}(\text{logits}^\top,\, \mathbf{I}_N)\right)$$
BYOL / SimSiam (No Negatives)
$$\mathcal{L}_{\text{BYOL}} = 2 - 2\cdot\frac{\langle q_\theta(\mathbf{z}_1),\, \text{sg}(\mathbf{z}_2')\rangle}{\|q_\theta(\mathbf{z}_1)\|\,\|\mathbf{z}_2'\|}$$
Where $\text{sg}(\cdot)$ is stop-gradient, $\mathbf{z}_2'$ comes from an EMA target encoder, and $q_\theta$ is a predictor MLP.
27
JEPA (Joint Embedding Predictive Architecture)
Yann LeCun's proposed path to human-level AI — predicting in latent space rather than pixel space.
Self-Supervised
Representation Learning / World Models
2022 — LeCun / 2023 — Assran et al. (I-JEPA)
Core Principle
Unlike generative models (which predict pixels) or contrastive models (which compare positive/negative pairs), JEPA predicts the representation of a target from a context — entirely in embedding space:
$$\text{Generative:}\quad \text{predict } \mathbf{x} \;\text{(pixel space)}$$
$$\text{Contrastive:}\quad \text{maximize } \text{sim}(f(\mathbf{x}), f(\mathbf{x}^+)) \text{ vs } f(\mathbf{x}^-)$$
$$\text{JEPA:}\quad \text{predict } \bar{f}(\mathbf{y}) \text{ from } f(\mathbf{x}) \;\text{(latent space)}$$
Architecture
$$\mathbf{s}_x = f_\theta(\mathbf{x}) \quad \text{(context encoder)}$$
$$\bar{\mathbf{s}}_y = \bar{f}_{\bar{\theta}}(\mathbf{y}) \quad \text{(target encoder — EMA of } f_\theta\text{)}$$
$$\hat{\mathbf{s}}_y = g_\phi(\mathbf{s}_x, \mathbf{m}) \quad \text{(predictor, conditioned on mask } \mathbf{m}\text{)}$$
Loss Function
$$\mathcal{L}_{\text{JEPA}} = \|\hat{\mathbf{s}}_y - \text{sg}(\bar{\mathbf{s}}_y)\|^2$$
Where $\text{sg}(\cdot)$ is stop-gradient. The target encoder $\bar{f}$ is updated via exponential moving average (EMA):
$$\bar{\theta} \leftarrow \alpha\,\bar{\theta} + (1-\alpha)\,\theta, \quad \alpha \in [0.996, 1)$$
I-JEPA (Image JEPA)
The context encoder sees a partial view of the image (with masked patches), and the predictor must predict target block representations in latent space:
$$\mathbf{x}_{\text{context}} = \text{ViT}_\theta(\text{visible patches})$$
$$\hat{\mathbf{s}}_{y_m} = g_\phi(\mathbf{x}_{\text{context}}, \text{pos}(y_m)) \quad \forall\, m \in \text{target blocks}$$
$$\mathcal{L}_{\text{I-JEPA}} = \frac{1}{M}\sum_{m=1}^M \|\hat{\mathbf{s}}_{y_m} - \text{sg}(\bar{\mathbf{s}}_{y_m})\|^2$$
V-JEPA (Video JEPA)
Extends to video by masking spacetime tubes and predicting their latent representations:
$$\mathbf{x} \in \mathbb{R}^{T \times H \times W \times C} \rightarrow \text{mask spacetime tubes} \rightarrow \text{predict in latent space}$$
JEPA vs Other Paradigms
| Method | Prediction Space | Negatives? | Collapse Prevention |
| Autoencoder | Pixel / Input | No | Bottleneck |
| Contrastive (SimCLR) | Latent (similarity) | Yes | Negative pairs |
| BYOL / SimSiam | Latent | No | EMA + stop-gradient |
| JEPA | Latent (prediction) | No | EMA + stop-gradient + masking |
28
Graph Neural Networks
Neural networks operating on graph-structured data.
Supervised / Semi-Supervised
Graphs & Networks
Message Passing Framework
$$\mathbf{m}_v^{(l)} = \bigoplus_{u \in \mathcal{N}(v)} M^{(l)}\!\left(\mathbf{h}_v^{(l)}, \mathbf{h}_u^{(l)}, \mathbf{e}_{vu}\right)$$
$$\mathbf{h}_v^{(l+1)} = U^{(l)}\!\left(\mathbf{h}_v^{(l)}, \mathbf{m}_v^{(l)}\right)$$
Graph Convolutional Network (GCN)
$$\mathbf{H}^{(l+1)} = \sigma\!\left(\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}\mathbf{H}^{(l)}\mathbf{W}^{(l)}\right)$$
Where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (adjacency with self-loops), $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$.
Graph Attention Network (GAT)
$$e_{ij} = \text{LeakyReLU}\!\left(\mathbf{a}^\top [\mathbf{W}\mathbf{h}_i \,\|\, \mathbf{W}\mathbf{h}_j]\right)$$
$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}(i)}\exp(e_{ik})}$$
$$\mathbf{h}_i' = \sigma\!\left(\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\,\mathbf{W}\mathbf{h}_j\right)$$
$$\mathbf{h}_{\mathcal{N}(v)}^{(l)} = \text{AGGREGATE}\!\left(\left\{\mathbf{h}_u^{(l)}: u \in \mathcal{N}(v)\right\}\right)$$
$$\mathbf{h}_v^{(l+1)} = \sigma\!\left(\mathbf{W}^{(l)}\cdot[\mathbf{h}_v^{(l)} \,\|\, \mathbf{h}_{\mathcal{N}(v)}^{(l)}]\right)$$
Graph Readout
$$\mathbf{h}_G = \text{READOUT}\!\left(\left\{\mathbf{h}_v^{(L)} : v \in G\right\}\right) = \frac{1}{|V|}\sum_{v\in V}\mathbf{h}_v^{(L)}$$
29
Capsule Networks
Encoding part-whole relationships with vector-valued capsules.
Supervised
Computer Vision
2017 — Sabour, Frosst, Hinton
Squash Function
$$\text{squash}(\mathbf{s}_j) = \frac{\|\mathbf{s}_j\|^2}{1 + \|\mathbf{s}_j\|^2}\cdot\frac{\mathbf{s}_j}{\|\mathbf{s}_j\|}$$
Dynamic Routing
$$\hat{\mathbf{u}}_{j|i} = \mathbf{W}_{ij}\,\mathbf{u}_i \quad \text{(prediction vectors)}$$
$$c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})} \quad \text{(coupling coefficients)}$$
$$\mathbf{s}_j = \sum_i c_{ij}\,\hat{\mathbf{u}}_{j|i}, \quad \mathbf{v}_j = \text{squash}(\mathbf{s}_j)$$
$$b_{ij} \leftarrow b_{ij} + \hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j \quad \text{(routing update)}$$
Margin Loss
$$\mathcal{L}_k = T_k \max(0, m^+ - \|\mathbf{v}_k\|)^2 + \lambda(1-T_k)\max(0, \|\mathbf{v}_k\| - m^-)^2$$
30
Hopfield Network
Associative memory via energy minimization in a fully connected network.
Unsupervised
Associative Memory
1982 — Hopfield
Energy Function
$$E = -\frac{1}{2}\sum_{i\neq j} w_{ij}\,s_i\,s_j - \sum_i \theta_i\,s_i = -\frac{1}{2}\mathbf{s}^\top\mathbf{W}\mathbf{s} - \boldsymbol{\theta}^\top\mathbf{s}$$
Hebbian Learning (Storage)
$$w_{ij} = \frac{1}{N}\sum_{\mu=1}^{P}\xi_i^\mu\,\xi_j^\mu, \quad w_{ii}=0$$
$$\mathbf{W} = \frac{1}{N}\sum_{\mu=1}^{P}\boldsymbol{\xi}^\mu (\boldsymbol{\xi}^\mu)^\top - \frac{P}{N}\mathbf{I}$$
Update Rule (Asynchronous)
$$s_i \leftarrow \text{sgn}\!\left(\sum_j w_{ij}\,s_j + \theta_i\right)$$
Storage Capacity
$$P_{\max} \approx \frac{N}{2\ln N}$$
Modern Hopfield Network (2020)
$$E = -\text{lse}(\beta\,\boldsymbol{\Xi}^\top\boldsymbol{\xi}) + \frac{1}{2}\boldsymbol{\xi}^\top\boldsymbol{\xi} + \text{const}$$
$$\text{Update:}\quad \boldsymbol{\xi}_{\text{new}} = \boldsymbol{\Xi}\,\text{softmax}(\beta\,\boldsymbol{\Xi}^\top\boldsymbol{\xi})$$
This update rule is equivalent to the attention mechanism in transformers.
31
Boltzmann Machine
Stochastic neural network based on statistical mechanics.
Generative
Stochastic
1985 — Hinton & Sejnowski
Energy Function
$$E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^\top\mathbf{W}\mathbf{h} - \mathbf{b}^\top\mathbf{v} - \mathbf{c}^\top\mathbf{h} - \frac{1}{2}\mathbf{v}^\top\mathbf{L}\mathbf{v} - \frac{1}{2}\mathbf{h}^\top\mathbf{J}\mathbf{h}$$
Probability Distribution
$$p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\exp(-E(\mathbf{v}, \mathbf{h})), \quad Z = \sum_{\mathbf{v},\mathbf{h}}\exp(-E(\mathbf{v},\mathbf{h}))$$
Stochastic Update
$$p(s_i = 1 | \mathbf{s}_{-i}) = \sigma\!\left(\sum_j w_{ij} s_j + b_i\right)$$
32
Restricted Boltzmann Machine (RBM)
A bipartite Boltzmann machine enabling efficient training via Gibbs sampling.
Generative
2006 — Hinton
Energy
$$E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^\top\mathbf{W}\mathbf{h} - \mathbf{b}^\top\mathbf{v} - \mathbf{c}^\top\mathbf{h}$$
Conditional Distributions
$$p(h_j = 1|\mathbf{v}) = \sigma\!\left(\mathbf{W}_{:,j}^\top\mathbf{v} + c_j\right)$$
$$p(v_i = 1|\mathbf{h}) = \sigma\!\left(\mathbf{W}_{i,:}\mathbf{h} + b_i\right)$$
Contrastive Divergence (CD-k)
$$\Delta \mathbf{W} = \eta\left(\langle\mathbf{v}\mathbf{h}^\top\rangle_{\text{data}} - \langle\mathbf{v}\mathbf{h}^\top\rangle_{\text{recon}}\right)$$
Free Energy
$$F(\mathbf{v}) = -\mathbf{b}^\top\mathbf{v} - \sum_j \log\!\left(1 + \exp(\mathbf{W}_{:,j}^\top\mathbf{v} + c_j)\right)$$
33
Radial Basis Function Network
Using radial basis functions as activation in a single hidden layer.
Supervised
Function Approximation
Architecture
$$\phi_j(\mathbf{x}) = \exp\!\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_j\|^2}{2\sigma_j^2}\right)$$
$$f(\mathbf{x}) = \sum_{j=1}^{K} w_j\,\phi_j(\mathbf{x}) + b = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}) + b$$
Training
Typically a two-phase process: (1) find centers $\boldsymbol{\mu}_j$ via k-means clustering; (2) solve for weights $\mathbf{w}$ via least squares:
$$\mathbf{w}^* = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{y}$$
Where $\Phi_{ij} = \phi_j(\mathbf{x}_i)$ is the interpolation matrix.
34
Self-Organizing Map (SOM)
Unsupervised learning that maps high-dimensional data to a low-dimensional grid preserving topology.
Unsupervised
Dimensionality Reduction
1982 — Kohonen
Best Matching Unit (BMU)
$$c = \arg\min_j \|\mathbf{x} - \mathbf{w}_j\|$$
Weight Update
$$\mathbf{w}_j(t+1) = \mathbf{w}_j(t) + \eta(t)\,h_{cj}(t)\,\left(\mathbf{x}(t) - \mathbf{w}_j(t)\right)$$
Neighborhood Function
$$h_{cj}(t) = \exp\!\left(-\frac{\|r_c - r_j\|^2}{2\sigma(t)^2}\right)$$
Both $\eta(t)$ and $\sigma(t)$ decrease monotonically over training.
35
Residual Networks (ResNet)
Skip connections enabling training of very deep networks.
Supervised
Computer Vision
2015 — He et al.
Residual Block
$$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$$
The network learns the residual $\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.
Bottleneck Block
$$\mathcal{F}(\mathbf{x}) = \mathbf{W}_3\,\text{ReLU}\!\left(\text{BN}\!\left(\mathbf{W}_2\,\text{ReLU}\!\left(\text{BN}(\mathbf{W}_1\mathbf{x})\right)\right)\right)$$
$\mathbf{W}_1$ reduces channels (1×1), $\mathbf{W}_2$ is 3×3 conv, $\mathbf{W}_3$ expands channels (1×1).
Gradient Flow
$$\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}\left(1 + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1}\mathcal{F}(\mathbf{x}_i)\right)$$
The "1 +" term ensures gradients can flow directly to any layer without attenuation.
Pre-Activation ResNet
$$\mathbf{y} = \mathbf{x} + \mathbf{W}_2\,\text{ReLU}\!\left(\text{BN}\!\left(\mathbf{W}_1\,\text{ReLU}(\text{BN}(\mathbf{x}))\right)\right)$$
36
Neural Ordinary Differential Equations
Continuous-depth networks defined by differential equations.
Architecture
2018 — Chen et al.
Continuous Dynamics
$$\frac{d\mathbf{h}(t)}{dt} = f_\theta(\mathbf{h}(t), t)$$
$$\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f_\theta(\mathbf{h}(t), t)\,dt$$
Adjoint Method (Memory-Efficient Backprop)
$$\mathbf{a}(t) = -\frac{\partial \mathcal{L}}{\partial \mathbf{h}(t)}$$
$$\frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f_\theta}{\partial \mathbf{h}}$$
$$\frac{d\mathcal{L}}{d\theta} = -\int_T^0 \mathbf{a}(t)^\top \frac{\partial f_\theta(\mathbf{h}(t), t)}{\partial \theta}\,dt$$
Memory cost is $O(1)$ regardless of depth, since states are recomputed during the backward ODE solve.
Connection to ResNets
$$\text{
ResNet:}\quad \mathbf{h}_{t+1} = \mathbf{h}_t + f_\theta(\mathbf{h}_t) \quad\longleftrightarrow\quad \text{Neural ODE:}\quad \frac{d\mathbf{h}}{dt} = f_\theta(\mathbf{h}, t)$$
37
Echo State Network (Reservoir Computing)
A fixed random recurrent reservoir with only output weights trained.
Supervised
Time Series
2001 — Jaeger
Reservoir Dynamics
$$\mathbf{h}(t) = (1-\alpha)\mathbf{h}(t-1) + \alpha\,\tanh\!\left(\mathbf{W}_{\text{res}}\mathbf{h}(t-1) + \mathbf{W}_{\text{in}}\mathbf{x}(t) + \mathbf{b}\right)$$
$\mathbf{W}_{\text{res}}$ and $\mathbf{W}_{\text{in}}$ are random and fixed. $\alpha$ is the leaking rate.
Output (Readout)
$$\mathbf{y}(t) = \mathbf{W}_{\text{out}}\,[\mathbf{h}(t);\, \mathbf{x}(t)]$$
$$\mathbf{W}_{\text{out}} = \mathbf{Y}\mathbf{H}^\top(\mathbf{H}\mathbf{H}^\top + \lambda\mathbf{I})^{-1}$$
Echo State Property
The reservoir must satisfy the echo state property: the spectral radius $\rho(\mathbf{W}_{\text{res}}) < 1$ ensures that the effect of initial conditions fades over time.
38
Spiking Neural Network
Biologically plausible networks where neurons communicate via discrete spikes.
Neuromorphic
Event-Driven
Leaky Integrate-and-Fire (LIF) Model
$$\tau_m \frac{dV(t)}{dt} = -[V(t) - V_{\text{rest}}] + R\,I(t)$$
$$\text{If } V(t) \geq V_{\text{th}}:\quad \text{emit spike, } V(t) \leftarrow V_{\text{reset}}$$
Discrete LIF
$$V[t] = \beta\,V[t-1] + \sum_j w_j\,S_j[t] - V_{\text{th}}\,S_{\text{out}}[t-1]$$
$$S_{\text{out}}[t] = \Theta(V[t] - V_{\text{th}})$$
Where $\beta = \exp(-\Delta t / \tau_m)$ is the decay factor and $\Theta$ is the Heaviside step function.
Surrogate Gradient
Since $\Theta'(x) = \delta(x)$ is not useful for backprop, replace with a smooth surrogate:
$$\tilde{\Theta}'(x) = \frac{1}{\pi}\cdot\frac{1}{1 + (\pi x)^2} \quad\text{(fast sigmoid surrogate)}$$
Spike-Timing-Dependent Plasticity (STDP)
$$\Delta w = \begin{cases} A_+ \exp\!\left(-\frac{\Delta t}{\tau_+}\right) & \text{if } \Delta t > 0 \text{ (pre before post)} \\ -A_- \exp\!\left(\frac{\Delta t}{\tau_-}\right) & \text{if } \Delta t < 0 \text{ (post before pre)} \end{cases}$$
39
Kolmogorov-Arnold Network (KAN)
Learnable activation functions on edges, based on the Kolmogorov-Arnold representation theorem.
Architecture
2024 — Liu et al.
Kolmogorov-Arnold Representation Theorem
$$f(\mathbf{x}) = f(x_1, \dots, x_n) = \sum_{q=0}^{2n}\Phi_q\!\left(\sum_{p=1}^n \phi_{q,p}(x_p)\right)$$
KAN Layer
Each edge $(i, j)$ has a learnable univariate function $\phi_{ij}$, parameterized by B-splines:
$$\phi_{ij}(x) = w_b\,\text{SiLU}(x) + w_s\,\text{Spline}(x)$$
$$\text{Spline}(x) = \sum_k c_k\,B_k(x)$$
Layer Computation
$$x_j^{(l+1)} = \sum_{i=1}^{n_l} \phi_{ij}^{(l)}(x_i^{(l)})$$
Compared to MLPs which have fixed activations on nodes and learnable linear weights on edges, KANs have learnable nonlinear functions on edges and summation on nodes.
40
State Space Models (S4 / Mamba)
Sequence models based on continuous-time state space representations with efficient linear-time computation.
Sequence Modeling
2021 — Gu et al.
Continuous State Space
$$\frac{d\mathbf{h}(t)}{dt} = \mathbf{A}\,\mathbf{h}(t) + \mathbf{B}\,x(t)$$
$$y(t) = \mathbf{C}\,\mathbf{h}(t) + D\,x(t)$$
Discretization (Zero-Order Hold)
$$\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}) \approx (\mathbf{I} - \Delta\mathbf{A}/2)^{-1}(\mathbf{I} + \Delta\mathbf{A}/2)$$
$$\bar{\mathbf{B}} = (\Delta\mathbf{A})^{-1}(\bar{\mathbf{A}} - \mathbf{I})\cdot\Delta\mathbf{B}$$
Discrete Recurrence
$$\mathbf{h}_k = \bar{\mathbf{A}}\,\mathbf{h}_{k-1} + \bar{\mathbf{B}}\,x_k$$
$$y_k = \mathbf{C}\,\mathbf{h}_k + D\,x_k$$
Convolution Form
$$\bar{\mathbf{K}} = (\mathbf{C}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\; \dots,\; \mathbf{C}\bar{\mathbf{A}}^{L-1}\bar{\mathbf{B}})$$
$$\mathbf{y} = \bar{\mathbf{K}} * \mathbf{x}$$
Computed in $O(L \log L)$ via FFT during training.
HiPPO Initialization
$$A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}$$
Selective SSM (Mamba)
Makes parameters input-dependent for content-aware reasoning:
$$\mathbf{B}_k = s_B(\mathbf{x}_k), \quad \mathbf{C}_k = s_C(\mathbf{x}_k), \quad \Delta_k = \text{softplus}(s_\Delta(\mathbf{x}_k))$$
41
Hypernetworks
Networks that generate the weights of another network.
Meta-Learning
2016 — Ha, Dai, Le
Formulation
$$\boldsymbol{\theta} = h_\psi(\mathbf{z})$$
$$\hat{\mathbf{y}} = f_{\boldsymbol{\theta}}(\mathbf{x}) = f_{h_\psi(\mathbf{z})}(\mathbf{x})$$
The hypernetwork $h_\psi$ maps an embedding $\mathbf{z}$ (which can be task-specific, layer-specific, or input-dependent) to the parameters of the main network $f$.
Training
$$\mathcal{L}(\psi) = \mathbb{E}\!\left[\ell\!\left(f_{h_\psi(\mathbf{z})}(\mathbf{x}),\, \mathbf{y}\right)\right]$$
$$\nabla_\psi\mathcal{L} = \nabla_\theta\ell \cdot \frac{\partial h_\psi(\mathbf{z})}{\partial \psi}$$
42
Neural Cellular Automata
Learned local update rules that produce global emergent behavior.
Self-Organizing
Morphogenesis
2020 — Mordvintsev et al.
Cell State Update
$$\text{Perception:}\quad \mathbf{p}_i = [\text{Sobel}_x * \mathbf{s}_i;\; \text{Sobel}_y * \mathbf{s}_i;\; \mathbf{s}_i]$$
$$\text{Update:}\quad \Delta\mathbf{s}_i = f_\theta(\mathbf{p}_i)$$
$$\text{Stochastic mask:}\quad m_i \sim \text{Bernoulli}(p)$$
$$\mathbf{s}_i^{t+1} = \mathbf{s}_i^t + m_i \cdot \Delta\mathbf{s}_i$$
All cells share the same neural network $f_\theta$, and the stochastic update mask enforces asynchrony for robustness.
Training via Differentiable Simulation
$$\mathcal{L} = \mathbb{E}_{t\sim[t_{\min}, t_{\max}]}\!\left[\|\mathbf{S}^{(t)} - \mathbf{S}_{\text{target}}\|^2\right]$$
Gradients are backpropagated through time across the simulation steps.
43
Neural Turing Machine / Differentiable Neural Computer
Neural networks augmented with external differentiable memory — capable of learning algorithms.
Supervised
Algorithmic Reasoning
2014 — Graves et al.
Architecture
A controller network (LSTM or MLP) interacts with an external memory matrix $\mathbf{M} \in \mathbb{R}^{N \times M}$ via differentiable read/write heads:
Addressing — Content-Based
$$w_t^c(i) = \frac{\exp(\beta_t\, K[\mathbf{k}_t, \mathbf{M}_t(i)])}{\sum_j \exp(\beta_t\, K[\mathbf{k}_t, \mathbf{M}_t(j)])}$$
$$K[\mathbf{u}, \mathbf{v}] = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\,\|\mathbf{v}\|} \quad \text{(cosine similarity)}$$
Addressing — Location-Based
$$\mathbf{w}_t^g = g_t\,\mathbf{w}_t^c + (1-g_t)\,\mathbf{w}_{t-1} \quad \text{(interpolation)}$$
$$\tilde{w}_t(i) = \sum_{j=0}^{N-1} w_t^g(j)\, s_t(i - j) \quad \text{(convolutional shift)}$$
$$w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}} \quad \text{(sharpening)}$$
Read & Write
$$\text{Read:}\quad \mathbf{r}_t = \sum_i w_t^r(i)\,\mathbf{M}_t(i)$$
$$\text{Write:}\quad \mathbf{M}_t = \mathbf{M}_{t-1}\odot(\mathbf{1} - \mathbf{w}_t^w\,\mathbf{e}_t^\top) + \mathbf{w}_t^w\,\mathbf{a}_t^\top$$
$\mathbf{e}_t$ is the erase vector and $\mathbf{a}_t$ is the add vector.
44
Bayesian Neural Network
Placing probability distributions over weights for principled uncertainty quantification.
Probabilistic
Uncertainty Estimation
Bayesian Inference over Weights
$$p(\boldsymbol{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{p(\mathcal{D})} = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{\int p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})\,d\boldsymbol{\theta}}$$
Predictive Distribution
$$p(\mathbf{y}^*|\mathbf{x}^*, \mathcal{D}) = \int p(\mathbf{y}^*|\mathbf{x}^*, \boldsymbol{\theta})\,p(\boldsymbol{\theta}|\mathcal{D})\,d\boldsymbol{\theta}$$
$$\approx \frac{1}{S}\sum_{s=1}^{S} p(\mathbf{y}^*|\mathbf{x}^*, \boldsymbol{\theta}^{(s)}), \quad \boldsymbol{\theta}^{(s)} \sim p(\boldsymbol{\theta}|\mathcal{D})$$
Variational Inference (Bayes by Backprop)
Approximate the intractable posterior with $q_\phi(\boldsymbol{\theta})$:
$$\mathcal{L}_{\text{VI}} = \text{KL}(q_\phi(\boldsymbol{\theta})\,\|\,p(\boldsymbol{\theta})) - \mathbb{E}_{q_\phi}[\log p(\mathcal{D}|\boldsymbol{\theta})]$$
With the reparameterization trick: $\theta_i = \mu_i + \sigma_i\,\epsilon$, $\epsilon\sim\mathcal{N}(0,1)$.
MC Dropout as Approximate BNN
$$\text{Var}[\mathbf{y}^*] \approx \frac{1}{T}\sum_{t=1}^{T}\hat{\mathbf{y}}_t^2 - \left(\frac{1}{T}\sum_{t=1}^T \hat{\mathbf{y}}_t\right)^2$$
Running $T$ forward passes with dropout enabled at test time provides uncertainty estimates.
45
Liquid Neural Network
Continuous-time neural networks with input-dependent dynamics — inspired by C. elegans neuroscience.
Neuromorphic
Time Series / Robotics
2021 — Hasani et al. (MIT)
Liquid Time-Constant (LTC) Neuron
$$\frac{d\mathbf{h}(t)}{dt} = -\left[\frac{1}{\tau} + f_\theta(\mathbf{h}(t), \mathbf{x}(t))\right]\odot\mathbf{h}(t) + f_\theta(\mathbf{h}(t), \mathbf{x}(t))\odot A$$
The key insight: the time constant $\tau$ is modulated by the input, making dynamics input-dependent.
Neural Circuit Policy
$$f_\theta(\mathbf{h}, \mathbf{x}) = \sigma\!\left(\mathbf{W}\,[\mathbf{h};\,\mathbf{x}] + \mathbf{b}\right)$$
$$\tau_{\text{eff}}(t) = \frac{\tau}{1 + \tau\,f_\theta(\mathbf{h}(t), \mathbf{x}(t))}$$
Closed-Form Continuous-Depth (CfC)
An analytical solution avoiding ODE solvers:
$$\mathbf{h}(t) = \left(\mathbf{h}_0 - f_\infty\right)\odot\exp\!\left(-\frac{t}{\tau_{\text{eff}}}\right) + f_\infty$$
Where $f_\infty = A\,\sigma(\mathbf{W}[\mathbf{h}_0;\,\mathbf{x}] + \mathbf{b})$ is the steady-state.
Properties
Liquid networks are remarkably compact (19 neurons can drive a car) and inherently interpretable due to their neuroscience-inspired wiring.
46
Mixture Density Network
Predicting full conditional probability distributions using a mixture of Gaussians.
Supervised
Multi-Modal Regression
1994 — Bishop
Output Parameterization
A neural network outputs the parameters of a Gaussian mixture model:
$$p(\mathbf{y}|\mathbf{x}) = \sum_{k=1}^{K}\pi_k(\mathbf{x})\,\mathcal{N}\!\left(\mathbf{y};\, \boldsymbol{\mu}_k(\mathbf{x}),\, \sigma_k^2(\mathbf{x})\mathbf{I}\right)$$
Network Outputs
$$\boldsymbol{\pi}(\mathbf{x}) = \text{softmax}(\mathbf{z}_\pi), \quad \sum_k\pi_k = 1$$
$$\boldsymbol{\mu}_k(\mathbf{x}) = \mathbf{z}_{\mu_k} \quad \text{(unconstrained)}$$
$$\sigma_k(\mathbf{x}) = \exp(\mathbf{z}_{\sigma_k}) \quad \text{(positive)}$$
Loss (Negative Log-Likelihood)
$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log\sum_{k=1}^K \pi_k(\mathbf{x}_i)\,\mathcal{N}(\mathbf{y}_i;\, \boldsymbol{\mu}_k(\mathbf{x}_i), \sigma_k^2(\mathbf{x}_i))$$
MDNs can model one-to-many mappings (e.g., inverse kinematics, handwriting generation) where a single input maps to multiple valid outputs.
47
WaveNet
Autoregressive generative model with dilated causal convolutions for raw audio synthesis.
Generative
Audio / Speech
2016 — van den Oord et al. (DeepMind)
Autoregressive Formulation
$$p(\mathbf{x}) = \prod_{t=1}^{T} p(x_t | x_1, \dots, x_{t-1})$$
Dilated Causal Convolutions
Stack convolutions with exponentially increasing dilation rates to grow the receptive field efficiently:
$$(f *_d x)_t = \sum_{k=0}^{K-1} f_k \cdot x_{t - d \cdot k}$$
$$\text{Dilations:}\quad d = 1, 2, 4, 8, \dots, 512 \quad \text{(repeated)}$$
$$\text{Receptive field} = \text{blocks} \times \sum_{l=0}^{L-1} 2^l \times (K-1) + 1$$
Gated Activation
$$\mathbf{z} = \tanh(\mathbf{W}_{f,k} * \mathbf{x}) \odot \sigma(\mathbf{W}_{g,k} * \mathbf{x})$$
Conditional WaveNet
$$\mathbf{z} = \tanh(\mathbf{W}_f * \mathbf{x} + \mathbf{V}_f * \mathbf{c}) \odot \sigma(\mathbf{W}_g * \mathbf{x} + \mathbf{V}_g * \mathbf{c})$$
Where $\mathbf{c}$ is a conditioning signal (e.g., mel spectrogram, speaker ID, linguistic features).
μ-Law Quantization
$$f(x_t) = \text{sign}(x_t)\frac{\ln(1 + \mu|x_t|)}{\ln(1+\mu)}, \quad \mu = 255$$
Compresses the 16-bit audio range into 256 values for categorical output via softmax.
48
Large Language Model (LLM) Architecture LLM DEEP DIVE
How modern LLMs like GPT-4, Claude, LLaMA, Gemini, and Mistral combine the neural network building blocks documented above into a single coherent system.
LLM Core
Self-Supervised + RLHF
Language / Multimodal
2017–present
Component Map — What LLMs Use
Every building block below is documented in detail in the sections above. An LLM is fundamentally a decoder-only Transformer composed of these pieces:
Transformer (Decoder-Only)
The core architecture. Stacked blocks with causal self-attention, preventing tokens from attending to future positions.
→ Section 13: Transformer
Scaled Dot-Product Attention
The fundamental operation: $\text{softmax}(\mathbf{QK}^\top/\sqrt{d_k})\mathbf{V}$. Every token attends to all previous tokens.
→ Section 12: Attention Mechanism
Multi-Head / GQA
Parallel attention heads capture different relationship types. GQA shares KV heads to reduce memory 4–8×.
→ Section 13: Multi-Head Attention
RoPE Positional Encoding
Rotary embeddings encode relative position directly into Q/K vectors. Used by LLaMA, Mistral, Claude, Gemma.
→ Section 13: RoPE
Feed-Forward Network (MLP)
Two-layer MLP at each position: project up 4×, apply SiLU/GELU, project back down. The "memory" of the model.
→ Section 2: MLP, Section 13: FFN
SiLU / GELU Activation
Smooth activations used inside transformer FFNs. SiLU (Swish) in LLaMA/Mistral; GELU in GPT/BERT.
→ Section 3: Activation Functions
RMSNorm / LayerNorm
Normalizes activations for training stability. Modern LLMs prefer RMSNorm (no mean subtraction, faster).
→ Section 6: Regularization
Residual Connections
Skip connections around every attention and FFN sublayer. Essential for training 100+ layer models.
→ Section 27: Residual Networks
Adam with decoupled weight decay. The standard optimizer for all LLM pretraining.
→ Section 5: Optimizers
Backpropagation
Gradient computation through billions of parameters. Combined with gradient checkpointing for memory efficiency.
→ Section 4: Backpropagation
Dropout (Optional)
Used in GPT-2/3 training. Many modern LLMs (LLaMA, PaLM) omit dropout entirely, relying on data scale.
→ Section 6: Regularization
SSM / Mamba (Hybrid)
Some architectures (Jamba, Zamba) combine transformer blocks with Mamba SSM layers for linear-time long sequences.
→ Section 32: State Space Models
Tokenization
LLMs do not operate on raw characters. Text is first split into subword tokens using algorithms like BPE (Byte Pair Encoding), which iteratively merges the most frequent byte pairs:
$$\text{BPE: merge}(a, b) = ab \quad\text{where}\quad (a,b) = \arg\max_{(x,y)} \text{count}(xy)$$
$$|\mathcal{V}| \;\text{typically}\; 32{,}000 \;\text{to}\; 128{,}000 \;\text{tokens}$$
Each token is mapped to an integer ID, which is then looked up in the embedding table.
Token & Positional Embeddings
$$\mathbf{h}_0 = \mathbf{E}_{\text{tok}}[x_1, x_2, \dots, x_n] + \mathbf{E}_{\text{pos}}$$
$$\mathbf{E}_{\text{tok}} \in \mathbb{R}^{|\mathcal{V}| \times d}, \quad d \;\text{typically}\; 4096\;\text{to}\;16384$$
Modern LLMs typically use RoPE instead of learned positional embeddings, applied directly to Q and K vectors inside each attention layer rather than added to the input.
Weight Tying
Many LLMs share the token embedding matrix with the output projection (language model head):
$$\text{logits} = \mathbf{h}_L \cdot \mathbf{E}_{\text{tok}}^\top \in \mathbb{R}^{n \times |\mathcal{V}|}$$
The LLM Transformer Block (Full Equations)
A modern LLM (e.g. LLaMA-style) stacks $L$ identical blocks. Each block performs:
Step 1: Pre-Norm + Causal Multi-Head Attention + Residual
$$\mathbf{x}' = \text{RMSNorm}(\mathbf{h}^{(l)})$$
$$\mathbf{Q} = \mathbf{x}'\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{x}'\mathbf{W}_K, \quad \mathbf{V} = \mathbf{x}'\mathbf{W}_V$$
$$\mathbf{Q} = \text{RoPE}(\mathbf{Q}), \quad \mathbf{K} = \text{RoPE}(\mathbf{K})$$
$$\text{Attn} = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}_{\text{causal}}\right)\mathbf{V}$$
$$\mathbf{h}^{(l)}_{\text{mid}} = \mathbf{h}^{(l)} + \text{MultiHead}(\text{Attn})$$
Step 2: Pre-Norm + SwiGLU FFN + Residual
$$\mathbf{x}'' = \text{RMSNorm}(\mathbf{h}^{(l)}_{\text{mid}})$$
$$\text{SwiGLU}(\mathbf{x}'') = (\mathbf{x}''\mathbf{W}_1 \odot \text{SiLU}(\mathbf{x}''\mathbf{W}_{\text{gate}}))\,\mathbf{W}_2$$
$$\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)}_{\text{mid}} + \text{SwiGLU}(\mathbf{x}'')$$
Where $\mathbf{W}_1, \mathbf{W}_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ and $\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$, with $d_{\text{ff}} \approx \frac{8}{3}d$ (SwiGLU adjustment).
Final Output
$$\mathbf{h}_{\text{final}} = \text{RMSNorm}(\mathbf{h}^{(L)})$$
$$P(x_{n+1} | x_1, \dots, x_n) = \text{softmax}(\mathbf{h}_{\text{final}}[n]\,\mathbf{W}_{\text{head}})$$
┌─────────────────────────────────────────────────────┐
│ FULL LLM FORWARD PASS │
├─────────────────────────────────────────────────────┤
│ │
│ Tokens: [x₁, x₂, ..., xₙ] │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Embedding │ h₀ = E_tok[tokens] │
│ └────┬─────┘ │
│ │ │
│ ▼ ×L layers │
│ ╔═══════════════════════════════════╗ │
│ ║ ┌──────────┐ ║ │
│ ║ │ RMSNorm │ ║ │
│ ║ └────┬─────┘ ║ │
│ ║ ▼ ║ │
│ ║ ┌──────────────────────┐ ║ │
│ ║ │ Causal Multi-Head │ ║ │
│ ║ │ Attention + RoPE │◄─ KV Cache │
│ ║ │ (with GQA) │ ║ │
│ ║ └────┬─────────────────┘ ║ │
│ ║ │ + residual ║ │
│ ║ ▼ ║ │
│ ║ ┌──────────┐ ║ │
│ ║ │ RMSNorm │ ║ │
│ ║ └────┬─────┘ ║ │
│ ║ ▼ ║ │
│ ║ ┌──────────────────────┐ ║ │
│ ║ │ SwiGLU FFN │ ║ │
│ ║ │ (W₁ ⊙ SiLU(W_gate)) │ ║ │
│ ║ │ × W₂ │ ║ │
│ ║ └────┬─────────────────┘ ║ │
│ ║ │ + residual ║ │
│ ╚═══════╪═══════════════════════════╝ │
│ ▼ │
│ ┌──────────┐ │
│ │ RMSNorm │ (final) │
│ └────┬─────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ LM Head │ logits = h · W_head │
│ └────┬─────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ Softmax │ → P(next token) │
│ └──────────┘ │
└─────────────────────────────────────────────────────┘
KV Cache (Inference Optimization)
During autoregressive generation, previously computed key and value vectors are cached to avoid redundant computation:
$$\text{At step } t: \quad \mathbf{K}_{\text{cache}} = [\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_t], \quad \mathbf{V}_{\text{cache}} = [\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_t]$$
$$\text{Only compute:}\quad \mathbf{q}_t = \mathbf{x}_t\mathbf{W}_Q, \quad \mathbf{k}_t = \mathbf{x}_t\mathbf{W}_K, \quad \mathbf{v}_t = \mathbf{x}_t\mathbf{W}_V$$
$$\text{Attend:}\quad \mathbf{o}_t = \text{softmax}\!\left(\frac{\mathbf{q}_t \mathbf{K}_{\text{cache}}^\top}{\sqrt{d_k}}\right)\mathbf{V}_{\text{cache}}$$
KV Cache Memory
$$\text{Memory} = 2 \times L \times n_{\text{kv\_heads}} \times d_k \times n_{\text{seq}} \times \text{bytes per param}$$
For a 70B model with 8K context in FP16: ~2–4 GB of KV cache per sequence.
PagedAttention (vLLM)
Manages KV cache as virtual memory pages to eliminate fragmentation and enable efficient batching of variable-length sequences.
LLM Training Pipeline
Phase 1: Pre-Training (Next Token Prediction)
$$\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_1, \dots, x_{t-1})$$
$$= -\sum_{t=1}^{T}\log\frac{\exp(\mathbf{h}_t^\top \mathbf{e}_{x_t})}{\sum_{v\in\mathcal{V}}\exp(\mathbf{h}_t^\top \mathbf{e}_v)}$$
Trained on trillions of tokens from web text, books, code, etc.
Phase 2: Supervised Fine-Tuning (SFT)
$$\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{response}} \log P_\theta(x_t | \text{prompt}, x_1, \dots, x_{t-1})$$
Only the response tokens contribute to the loss; prompt tokens are masked.
Phase 3: RLHF (Reinforcement Learning from Human Feedback)
Step 3a: Train a reward model $r_\phi$ on human preference pairs $(y_w \succ y_l)$:
$$\mathcal{L}_{\text{reward}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log\sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$
Step 3b: Optimize the policy with PPO, constrained by a KL penalty from the reference model $\pi_{\text{ref}}$:
$$\max_{\pi_\theta}\; \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi_\theta(y|x)}\!\left[r_\phi(x,y)\right] - \beta\,\text{KL}\!\left(\pi_\theta(y|x)\,\|\,\pi_{\text{ref}}(y|x)\right)$$
DPO (Direct Preference Optimization)
Bypasses the reward model entirely by reparameterizing the RLHF objective:
$$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$
Scaling Laws
Kaplan Scaling (OpenAI, 2020)
$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$$
Loss follows power laws in model parameters $N$, dataset size $D$, and compute $C$.
Chinchilla Optimal (Hoffmann et al., 2022)
$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$
$$\text{Rule of thumb:}\quad D \approx 20 \times N$$
For a given compute budget, model size and data should be scaled equally — a 10B model needs ~200B tokens.
LLM Parameter Count
$$N \approx 12\,L\,d^2 \quad\text{(for standard transformer with } d_{\text{ff}} = 4d\text{)}$$
| Model | Layers $L$ | Dim $d$ | Heads $h$ | Params |
| GPT-2 | 48 | 1600 | 25 | 1.5B |
| LLaMA-2 7B | 32 | 4096 | 32 | 6.7B |
| LLaMA-2 70B | 80 | 8192 | 64 | 70B |
| GPT-4 (est.) | 120 | ~12288 | 96 | ~1.8T (MoE) |
Mixture of Experts (MoE)
Replace the dense FFN with a sparse set of expert FFNs, routing each token to the top-$k$ experts:
$$\mathbf{g}(\mathbf{x}) = \text{softmax}(\mathbf{W}_g\,\mathbf{x}) \in \mathbb{R}^{E}$$
$$\text{TopK}(\mathbf{g}, k): \quad \mathcal{S} = \{i : g_i \text{ is in top-}k\}$$
$$\text{MoE}(\mathbf{x}) = \sum_{i \in \mathcal{S}} \frac{g_i(\mathbf{x})}{\sum_{j\in\mathcal{S}} g_j(\mathbf{x})}\cdot \text{FFN}_i(\mathbf{x})$$
Load Balancing Loss
$$\mathcal{L}_{\text{balance}} = \alpha\,E \sum_{i=1}^{E} f_i \cdot P_i$$
$$f_i = \frac{\text{tokens routed to expert } i}{\text{total tokens}}, \quad P_i = \frac{1}{T}\sum_{t=1}^T g_i(\mathbf{x}_t)$$
Encourages equal load across experts. Mixtral 8×7B uses $E=8$ experts with $k=2$, giving 47B total params but only ~13B active per token.
Sampling & Decoding Strategies
Temperature Scaling
$$P(x_t = v) = \frac{\exp(z_v / \tau)}{\sum_{v'}\exp(z_{v'} / \tau)}$$
$\tau \to 0$: greedy (argmax). $\tau = 1$: standard softmax. $\tau > 1$: more random.
Top-$k$ Sampling
$$P'(v) = \begin{cases} P(v) / \sum_{v' \in V_k} P(v') & \text{if } v \in V_k \\ 0 & \text{otherwise} \end{cases}$$
Top-$p$ (Nucleus) Sampling
$$V_p = \min\left\{V' \subseteq \mathcal{V} : \sum_{v \in V'} P(v) \geq p\right\}$$
Min-$p$ Sampling
$$V_{\min p} = \{v : P(v) \geq p_{\min} \cdot \max_{v'}P(v')\}$$
Beam Search
$$\text{score}(\mathbf{y}_{1:t}) = \frac{1}{t^\alpha}\sum_{i=1}^t \log P(y_i | y_1, \dots, y_{i-1})$$
Maintains top-$B$ candidates at each step, with length normalization exponent $\alpha$.
Repetition Penalty
$$z'_v = \begin{cases} z_v / \theta & \text{if } v \in \text{generated tokens and } z_v > 0 \\ z_v \cdot \theta & \text{if } v \in \text{generated tokens and } z_v \leq 0 \end{cases}$$
LoRA & Parameter-Efficient Fine-Tuning
LoRA (Low-Rank Adaptation)
Freeze the pretrained weights and inject trainable low-rank decompositions:
$$\mathbf{W}' = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$
$$\mathbf{B} \in \mathbb{R}^{d \times r}, \quad \mathbf{A} \in \mathbb{R}^{r \times d}, \quad r \ll d$$
$$h = \mathbf{W}_0\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$
Typical $r = 8\text{–}64$, reducing trainable parameters by 1000× (e.g., 70B model → ~100M trainable params).
Combines LoRA with 4-bit quantized base weights using NormalFloat4 (NF4) data type:
$$\mathbf{W}_{\text{NF4}} = \text{quantize}_{4\text{bit}}(\mathbf{W}_0)$$
$$h = \text{dequant}(\mathbf{W}_{\text{NF4}})\,\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$
Fine-tune a 70B model on a single 48GB GPU.
Other PEFT Methods
| Method | Approach | Trainable Params |
| Prefix Tuning | Learnable "virtual tokens" prepended to keys/values | ~0.1% |
| Prompt Tuning | Learnable soft prompt embeddings | ~0.01% |
| Adapters | Small bottleneck layers inserted between transformer sublayers | ~1–3% |
| IA³ | Learned vectors that rescale keys, values, and FFN activations | ~0.01% |
Long Context Techniques
RoPE Frequency Scaling
$$\theta_i' = \theta_i \cdot s^{-1} = \frac{10000^{-2i/d}}{s} \quad\text{(linear scaling, factor } s \text{)}$$
Extending a 4K model to 32K context uses $s = 8$.
YaRN (Yet another RoPE extensioN)
$$\theta_i' = \begin{cases} \theta_i & \text{if } \lambda_i < \lambda_{\min} \;\text{(high freq, no change)} \\ \theta_i / s & \text{if } \lambda_i > \lambda_{\max} \;\text{(low freq, full scale)} \\ (1-\gamma)\theta_i + \gamma\,\theta_i/s & \text{otherwise (interpolate)} \end{cases}$$
Flash Attention
IO-aware exact attention that avoids materializing the $n \times n$ attention matrix:
$$\text{Standard:}\quad O(n^2) \text{ memory}, \quad \text{Flash:}\quad O(n) \text{ memory}$$
$$\text{Compute:}\quad O(n^2 d) \;\text{(same)} \quad\text{but}\;\sim 2\text{–}4\times \text{faster via HBM reduction}$$
Uses online softmax and tiling to keep intermediate results in SRAM, avoiding slow HBM reads/writes.
Ring Attention
Distributes sequence across devices in a ring topology, overlapping communication with computation for near-infinite context:
$$\text{Effective context} = n_{\text{devices}} \times n_{\text{per\_device}}$$
Sliding Window Attention
$$\text{Attn}(i, j) = \begin{cases} \text{softmax}(\mathbf{q}_i\mathbf{k}_j^\top/\sqrt{d_k}) & \text{if } |i-j| \leq w \\ 0 & \text{otherwise} \end{cases}$$
Used in Mistral. With $L$ layers and window $w$, effective receptive field is $L \times w$ tokens.
Neural Network Encyclopedia — 48 architectures — Generated March 2026
Covers: Perceptron, MLP, CNN, U-Net, RNN, LSTM, GRU, xLSTM, Bidirectional RNN, Attention, Transformer, BERT, Seq2Seq, ViT, RWKV, Autoencoder, VAE, GAN, Diffusion, Normalizing Flows, Energy-Based, Siamese/Contrastive (SimCLR, CLIP), JEPA, GNN, Capsule, Hopfield, Boltzmann, RBM, RBF, SOM, ResNet, Neural ODE, Echo State, Spiking NN, KAN, SSM/Mamba, Hypernetworks, Neural Cellular Automata, Neural Turing Machine, Bayesian NN, Liquid NN, Mixture Density, WaveNet + LLM Architecture Deep Dive
Glossary of Neural Network Terms
13 key technical terms used throughout this guide.
A
| Term | Definition |
| Activation Function | A non-linear function applied to neuron outputs (ReLU, sigmoid, tanh, GELU, SwiGLU). Without non-linearity, stacked layers would collapse to a single linear transformation. |
B
| Term | Definition |
| Backpropagation | The algorithm for computing gradients by applying the chain rule backwards through the computation graph. The fundamental mechanism for training neural networks. |
| Batch Normalization | Normalizing activations within a mini-batch to stabilize training. Reduces internal covariate shift. Used in CNNs; replaced by LayerNorm in Transformers. |
C
| Term | Definition |
| Convolutional Neural Network (CNN) | A neural network using convolutional filters for spatial pattern recognition. Dominant in computer vision. Key components: convolution, pooling, fully connected layers. |
D
| Term | Definition |
| Dropout | A regularization technique that randomly sets neuron outputs to zero during training with probability p. Prevents overfitting by reducing co-adaptation between neurons. |
G
| Term | Definition |
| Gradient Descent | An optimization algorithm that iteratively updates parameters in the direction of steepest loss decrease. Variants: SGD, Adam, AdamW. Learning rate controls step size. |
L
| Term | Definition |
| Learning Rate | The step size for parameter updates during optimization. Too high causes divergence; too low causes slow convergence. Scheduling (warmup, cosine decay) is critical for training stability. |
| Loss Function | A function measuring the difference between model predictions and target values. Cross-entropy for classification, MSE for regression, KL divergence for distribution matching. |
O
| Term | Definition |
| Overfitting | When a model memorizes training data patterns including noise, performing well on training data but poorly on unseen data. Prevented by regularization, dropout, early stopping, and data augmentation. |
R
| Term | Definition |
| Recurrent Neural Network (RNN) | A network with feedback connections that processes sequential data by maintaining hidden state across time steps. Variants: LSTM, GRU. Largely replaced by Transformers. |
| Regularization | Techniques preventing overfitting: L1/L2 weight penalty, dropout, data augmentation, early stopping, weight decay. Controls model complexity. |
T
| Term | Definition |
| Tensor | A multi-dimensional array — the fundamental data structure in deep learning. Scalars (0D), vectors (1D), matrices (2D), and higher-order tensors are all processed on GPUs. |
| Transfer Learning | Using a model pre-trained on one task as a starting point for another. Fine-tuning a pre-trained LLM is transfer learning. Dramatically reduces training data and compute requirements. |