LLM Glossary

A - B

Activation Function

A mathematical function applied to the output of neurons that introduces non-linearity to neural networks, enabling them to learn complex patterns. Common functions include ReLU, GELU, and SwiGLU.

Architecture

Adapter

A small trainable module inserted into pretrained models that enables parameter-efficient fine-tuning. Adapters allow domain-specific adaptation without updating all model weights.

Training

AdamW

An optimizer that combines Adam's adaptive learning rates with decoupled weight decay (L2 regularization). It's the standard optimizer for training modern large language models.

Training

Alignment

The process of fine-tuning LLMs to behave according to human values and intentions through techniques like RLHF, SFT, and Constitutional AI, ensuring models are helpful, harmless, and honest.

Training Safety

ALiBi (Attention with Linear Biases)

A positional encoding method that adds linear biases to attention scores based on distance between tokens, enabling extrapolation to longer sequences without explicit position embeddings.

Architecture

Attention

A mechanism that computes a weighted combination of values based on similarities between queries and keys. Self-attention allows tokens to attend to all positions, forming the core of transformers.

Architecture

Autoregressive

A generation strategy where the model predicts the next token given all previous tokens. Used by most LLMs, it generates text one token at a time in a sequential manner.

Inference

AWQ (Activation-aware Weight Quantization)

A post-training quantization technique that preserves weight channels with large activation magnitudes at full precision while quantizing others, improving quantized model quality.

Compression

Backpropagation

The algorithm for computing gradients of a loss function with respect to model parameters, enabling training through iterative weight updates. It forms the foundation of neural network optimization.

Training

Batch Size

The number of samples processed together in a single training step. Larger batches improve hardware efficiency but may affect convergence; common values are 32, 64, or higher for distributed training.

Training

BF16 (Bfloat16)

A 16-bit floating-point format that maintains the range of FP32 while reducing precision, enabling faster training with minimal accuracy loss. Commonly used in modern large-scale training.

Compression Training

BLEU (Bilingual Evaluation Understudy)

A metric for evaluating machine translation quality by comparing n-gram overlap with reference translations. While widely used, it has limitations for semantic evaluation.

Safety

BPE (Byte-Pair Encoding)

A tokenization algorithm that iteratively merges the most frequent adjacent byte or character pairs, reducing vocabulary size while preserving information. Used by GPT and modern LLMs.

Architecture

Beam Search

A decoding algorithm that keeps the k most likely partial hypotheses at each step, exploring multiple paths through the generation tree to find higher-quality outputs than greedy decoding.

Inference

C - D

Calibration

The process of adjusting quantization parameters (scale and zero-point) using a representative dataset to minimize accuracy loss when converting models to lower precision.

Compression

Causal Language Model (CLM)

A language model that predicts the next token based only on previous tokens, maintaining unidirectional attention flow. Used for autoregressive generation tasks like GPT models.

Architecture

Chain-of-Thought (CoT)

A prompting technique that encourages LLMs to explain their reasoning step-by-step before providing final answers, often improving accuracy on complex reasoning tasks.

Inference

Chinchilla Scaling Laws

Research finding that optimal model size and data size should be scaled equally for compute-optimal training, not favoring larger models with fewer training tokens.

Training

CLIP (Contrastive Language-Image Pre-training)

A multimodal architecture that learns joint embeddings of images and text through contrastive learning, enabling zero-shot vision and language tasks.

Architecture

Constitutional AI (CAI)

An alignment technique where models are given explicit principles (constitution) to follow during training, enabling red-teaming and self-improvement without human labels.

Training Safety

Context Window

The maximum number of tokens a model can process in a single forward pass. Modern LLMs range from 4K to 128K+ tokens, affecting the amount of input history available.

Architecture

Contrastive Learning

A training objective that maximizes similarity between positive pairs while minimizing similarity with negatives, useful for learning meaningful representations without labeled data.

Training

Cross-Attention

An attention mechanism where queries come from one sequence and keys/values from another, enabling interaction between different modalities or sequences (e.g., image and text).

Architecture

Cross-Encoder

A model architecture that jointly encodes query and document pairs to predict relevance scores. More accurate but slower than bi-encoders, often used as a reranker in RAG systems.

RAG

Curriculum Learning

A training strategy where examples are presented in increasing order of difficulty, helping models learn foundational patterns before tackling complex tasks.

Training

Data Parallelism

A distributed training strategy where each GPU processes different data samples independently while computing the same model, aggregating gradients across devices.

Training

DeBERTa

A transformer variant that uses disentangled attention mechanisms to separate content and position information, improving performance on NLU tasks compared to BERT.

Architecture

Decoder-Only

An architecture design where only the decoder portion of a transformer is used, commonly seen in autoregressive language models like GPT. Uses causal masking to prevent attending to future tokens.

Architecture

DeepSpeed

A Microsoft optimization library providing techniques like ZeRO, gradient accumulation, and mixed precision training to enable efficient large-scale model training.

Training

Differential Privacy

A formal framework for training models on sensitive data with guarantees that individual records cannot be easily identified, important for privacy-preserving machine learning.

Safety

Distillation

A training technique where a smaller student model learns from a larger teacher model's outputs, enabling knowledge transfer and compression for deployment.

Compression Training

DPO (Direct Preference Optimization)

An alignment method that directly optimizes language models to prefer chosen outputs over rejected ones without requiring reward models, simplifying RLHF training.

Training Safety

Dropout

A regularization technique that randomly masks activations during training to prevent co-adaptation, reducing overfitting while maintaining model capacity.

Training

E - F

ECE (Expected Calibration Error)

A metric measuring the difference between predicted confidence and actual accuracy, indicating how well calibrated a model's probability estimates are.

Safety

Embedding

A learned dense vector representation that maps discrete tokens or entities into a continuous space where semantic similarity is preserved as distance.

Architecture

Encoder-Decoder

An architecture with separate encoder and decoder components, where the encoder processes input sequences and the decoder generates outputs using encoded representations.

Architecture

Encoder-Only

An architecture containing only the encoder portion without an autoregressive decoder, designed for understanding tasks like classification and NER rather than generation.

Architecture

Epoch

One complete pass through the entire training dataset. Models are typically trained for multiple epochs to allow sufficient learning of patterns from the data.

Training

Expert Parallelism

A scaling strategy for Mixture-of-Experts models where different experts are distributed across different devices, enabling efficient training of very large conditional models.

Training

Faithfulness

The degree to which a model's generated explanations or outputs accurately represent its actual reasoning and decision-making process, crucial for interpretability.

Safety

FAISS (Facebook AI Similarity Search)

An efficient library for similarity search and vector indexing at scale, enabling fast retrieval of nearest neighbors in high-dimensional spaces for RAG systems.

RAG

Feed-Forward Network (FFN)

A subcomponent of transformer blocks consisting of two linear layers with non-linear activation in between, applied position-wise to each token independently.

Architecture

Few-Shot Learning

The ability to perform tasks with only a few labeled examples, leveraging in-context learning where examples are provided in the prompt.

Inference

Fine-Tuning

The process of adapting a pretrained model to a specific task or domain by continuing training on task-specific data with updated weights.

Training

Flash Attention

An efficient attention implementation that reduces memory I/O by computing attention in blocks, enabling faster training and inference while maintaining numerical stability.

Inference Training

FP16 (Float16)

A 16-bit floating-point format offering smaller memory footprint than FP32, though with reduced numerical precision. Used in mixed precision training.

Compression

FP32 (Float32)

Standard 32-bit floating-point precision with wide range and sufficient precision for most training tasks, though requiring more memory than lower precisions.

Training

FSDP (Fully Sharded Data Parallel)

A distributed training technique that shards model parameters, gradients, and optimizer states across devices, enabling efficient training of very large models with reduced per-device memory.

Training

G - H

GeLU (Gaussian Error Linear Unit)

A smooth activation function approximating the cumulative distribution function of a standard normal distribution, widely used in modern transformers like BERT and GPT.

Architecture

Generative Pre-trained Transformer (GPT)

A decoder-only transformer architecture trained on large text corpora using causal language modeling. Foundation for GPT series (GPT-3, GPT-4) and many open-source models.

Architecture

GGUF (GPT Generated Unified Format)

A model format designed for efficient quantized storage and inference, supporting multiple precision formats and enabling effective deployment on consumer hardware.

Compression

Gradient Accumulation

A technique where gradients are accumulated over multiple forward-backward passes before updating weights, enabling larger effective batch sizes with limited memory.

Training

Gradient Checkpointing

A memory optimization technique that recomputes activations during backpropagation rather than storing them, reducing memory usage at the cost of additional computation.

Training

Graph RAG

A RAG approach combining knowledge graphs with text retrieval, enabling structured reasoning over both entity relationships and textual content.

RAG

Grounding

The process of anchoring model outputs to factual information from external sources, typically through retrieval-augmented generation to reduce hallucinations.

RAG Safety

GPTQ

A post-training quantization method using second-order information that preserves accuracy with aggressive quantization (1-4 bits) for efficient model compression.

Compression

Grouped-Query Attention (GQA)

An attention variant where multiple query heads share single key and value heads, reducing memory and computation while maintaining quality compared to Multi-Head Attention.

Architecture

Hallucination

When LLMs generate plausible-sounding but factually incorrect or fabricated information, a major challenge mitigated through retrieval augmentation and alignment training.

Safety

Hidden State

The internal representation computed at each layer of a neural network for a given input, capturing extracted features that are progressively refined through deeper layers.

Architecture

HITL (Human-in-the-Loop)

A system design where humans and AI collaborate iteratively, with humans reviewing and correcting AI outputs to improve model performance and safety.

Safety

Hugging Face

A major platform providing model hubs, transformers library, and datasets, becoming the de facto standard for sharing and using pretrained language models.

Training

I - K

In-Context Learning

The ability of LLMs to perform tasks by conditioning on examples or instructions in the prompt without updating model weights, a key capability emerging in large models.

Inference

Instruction Tuning

Fine-tuning LLMs on datasets of instructions paired with desired outputs, enabling models to better follow explicit user instructions across diverse tasks.

Training

INT4 / INT8

Integer quantization formats using 4-bit or 8-bit integers to represent weights, drastically reducing memory and computation for efficient inference while maintaining reasonable accuracy.

Compression

KL Divergence

A measure of how one probability distribution diverges from a reference distribution, commonly used in alignment training to prevent models from deviating too far from base models.

Training Safety

Knowledge Distillation

A training method where a smaller student model learns to mimic a larger teacher model's outputs, enabling compression while retaining knowledge and performance.

Compression Training

Knowledge Graph

A structured representation of entities and their relationships, used in RAG systems to enable semantic reasoning and context-aware information retrieval.

RAG

KV Cache (Key-Value Cache)

A memory optimization during inference that caches previously computed key and value projections, enabling efficient autoregressive decoding without recomputation.

Inference

L - M

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method adding trainable low-rank decompositions to model weights, reducing trainable parameters by orders of magnitude while maintaining quality.

Training

Latency

The time required to generate outputs, critical for user-facing applications. Lower latency improves user experience but may require model compression or architectural innovations.

Inference

LayerNorm

A normalization technique that normalizes activations across feature dimensions within a layer, stabilizing training and improving convergence in deep networks.

Architecture

Leiden Algorithm

A graph clustering algorithm often used in knowledge graph partitioning for efficient chunking and retrieval in Graph RAG systems.

RAG

LLMLingua

A technique for intelligently compressing prompts while preserving information density, reducing context length and inference costs in RAG systems.

Inference Compression

Logits

The raw unnormalized output scores produced by a model before softmax, representing the model's preferences for each possible token before converting to probabilities.

Architecture

Mamba (SSM)

A state-space model architecture offering linear-time complexity per token, positioning as an alternative to transformers with better scaling properties and efficiency.

Architecture

Masked Language Model (MLM)

A training objective where random tokens are masked and the model predicts them from context, used in BERT and encoder-only models for learning bidirectional representations.

Training

Megatron-LLM

An NVIDIA library providing efficient implementations of tensor parallelism, pipeline parallelism, and other distributed training techniques for large-scale model training.

Training

MMLU (Massive Multitask Language Understanding)

A comprehensive benchmark spanning 57 diverse knowledge domains, widely used to evaluate general knowledge and reasoning capabilities of language models.

Safety

Mixture of Experts (MoE)

An architecture where different expert networks specialize on different input types, selected by a gating network, enabling scaling without proportional increase in computation.

Architecture

Multi-Head Attention (MHA)

Attention mechanism using multiple parallel attention heads, each learning different attention patterns, combining head outputs via concatenation.

Architecture

Multi-Query Attention (MQA)

An attention variant where all query heads share a single key and value head, reducing memory and computation during inference while maintaining near-original quality.

Architecture

MTEB (Massive Text Embedding Benchmark)

A comprehensive benchmark evaluating text embedding models across diverse tasks (classification, clustering, retrieval), enabling standardized comparison of embedding quality.

Safety

N - P

N:M Sparsity

A structured pruning pattern ensuring exactly M non-zero weights per N weights, compatible with specialized hardware for efficient inference of sparse models.

Compression

Natural Language Inference (NLI)

A task involving determining whether given premises entail, contradict, or are neutral to hypotheses, testing logical reasoning in language understanding.

Safety

Next-Token Prediction

The core training objective of causal language models predicting the next token given previous tokens, the foundation for autoregressive generation.

Training

Normalization

A class of techniques (layer norm, batch norm, group norm) that standardize activations, stabilizing training and improving convergence in deep networks.

Architecture

ONNX (Open Neural Network Exchange)

An open standard format for representing machine learning models, enabling model portability and optimization across different frameworks and hardware.

Inference

Optimizer

An algorithm for updating model parameters based on gradients. Common optimizers (SGD, Adam, AdamW) balance convergence speed, generalization, and computational efficiency.

Training

PagedAttention

A memory management technique for KV caches treating them as pages, enabling efficient sharing and dynamic memory allocation during batch inference.

Inference

Parameter-Efficient Fine-Tuning (PEFT)

A class of techniques (LoRA, adapters, prefix tuning) enabling effective fine-tuning with significantly fewer trainable parameters than full fine-tuning.

Training

Perplexity

A metric measuring how surprised a language model is by test data, computed as the exponentiated cross-entropy loss. Lower perplexity indicates better modeling of text.

Safety

Pipeline Parallelism

A distributed training strategy splitting model layers across devices with micro-batches, enabling training of very large models by overlapping computation and communication.

Training

Post-Training Quantization (PTQ)

Applying quantization after model training completes using calibration datasets, more convenient than QAT but potentially with lower quality due to lack of training adaptation.

Compression

PPO (Proximal Policy Optimization)

A reinforcement learning algorithm widely used in alignment training (RLHF) for optimizing language models against reward models, balancing exploration and exploitation.

Training Safety

Prefix Tuning

A parameter-efficient method prepending trainable prefix tokens to inputs, enabling task adaptation without modifying model weights, useful for multi-task scenarios.

Training

Prompt Tuning

A method learning task-specific continuous prompts through gradient descent rather than hand-crafting discrete text prompts, enabling efficient task-specific adaptation.

Training

Pruning

Removing less important weights or neurons from trained models to reduce size and computation while maintaining performance, enabling efficient deployment.

Compression

Q - R

QAT (Quantization-Aware Training)

Including quantization simulation during training so models learn to work effectively at lower precisions, generally achieving better quality than post-training quantization.

Compression Training

QLoRA

Combines quantization with LoRA for parameter-efficient fine-tuning, enabling training of 65B models on single 48GB GPUs through 4-bit quantization.

Compression Training

Quantization

Converting high-precision weights and activations to lower precision (int8, int4, etc.), reducing memory and computation while maintaining acceptable accuracy.

Compression

RAG (Retrieval-Augmented Generation)

Augmenting language models with external knowledge by retrieving relevant documents at generation time, reducing hallucinations and enabling knowledge updates without retraining.

RAG

RAGAS (Retrieval-Augmented Generation Assessment)

A framework for evaluating RAG system quality across multiple dimensions (faithfulness, relevance, coherence), enabling systematic comparison of RAG implementations.

RAG Safety

Recall@K

A retrieval metric measuring what fraction of relevant documents appear in the top K results, essential for evaluating retrieval component quality in RAG systems.

RAG

RECOMP

A compression method for in-context learning that distills long documents into short summaries while preserving task-relevant information for more efficient processing.

Compression Inference

Red-Teaming

Systematic adversarial testing where teams actively try to break or misuse AI systems to identify vulnerabilities and failure modes before deployment.

Safety

Reranker

A cross-encoder model that re-ranks initially retrieved candidates by jointly encoding queries and documents, improving precision in multi-stage retrieval pipelines.

RAG

Residual Connection

Shortcuts in neural networks enabling gradients to flow directly through layers via skip connections (x + f(x)), crucial for training deep networks.

Architecture

Reward Model

A learned model estimating preference scores for different outputs, used in RLHF to guide language model training toward human preferences without per-example annotation.

Training Safety

RLHF (Reinforcement Learning from Human Feedback)

An alignment technique using human feedback to train reward models, then optimizing language models via RL to maximize rewards, aligning outputs with human values.

Training Safety

RMSNorm

A simplified layer normalization that normalizes by RMS instead of standardization, computationally simpler and more stable, preferred in modern models like LLaMA.

Architecture

RoPE (Rotary Position Embedding)

A position encoding method rotating query and key vectors by angles proportional to token positions, enabling extrapolation to longer sequences than training.

Architecture

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Metrics comparing n-gram overlap between generated and reference summaries, widely used in evaluating abstractive summarization despite known limitations.

Safety

S - T

Sampling (Temperature, Top-K, Top-P)

Decoding strategies where temperature controls randomness, top-K selects from K highest probability tokens, and top-P samples from the smallest set with cumulative probability P.

Inference

Scaling Laws

Empirical relationships showing how model performance improves with increases in model size, dataset size, and compute, predicting optimal resource allocation for training.

Training

Self-Attention

Attention mechanism where a sequence attends to itself, with queries, keys, and values all derived from the same input, enabling modeling of dependencies within sequences.

Architecture

Self-Consistency

A decoding method sampling multiple diverse outputs and selecting the most consistent answer, improving reasoning accuracy by leveraging ensemble effects.

Inference

Self-Distillation

Using a model as both teacher and student to improve quality, where the model learns from its own ensemble or better-sampled outputs, enabling self-improvement.

Training

SentencePiece

A language-agnostic tokenization library implementing BPE and unigram language models, widely used for training tokenizers in modern multilingual models.

Architecture

Sequence Parallelism

Distributing sequence length across devices rather than just batch dimension, enabling training with longer sequences while reducing per-device memory requirements.

Training

SFT (Supervised Fine-Tuning)

Fine-tuning on high-quality demonstration data where inputs are paired with desired outputs, enabling models to follow instructions and improve task performance.

Training

Sliding Window Attention

An attention pattern where each token only attends to a fixed window of surrounding tokens, reducing computation from quadratic to linear while maintaining local context.

Architecture

Softmax

A function converting logits to probability distributions by exponentiating and normalizing, converting attention scores to weights that sum to 1.

Architecture

SparseGPT

A one-shot pruning method removing weights based on Hessian-weighted importance scores, enabling aggressive sparsity without retraining while maintaining performance.

Compression

Speculative Decoding

An inference acceleration technique where a smaller draft model generates candidates and a larger model validates them in parallel, reducing latency without quality loss.

Inference

SwiGLU

An activation function combining Swish activation with gating mechanisms in feedforward networks, improving training dynamics and convergence in transformers.

Architecture

Temperature

A hyperparameter controlling output randomness during decoding. Higher temperatures increase diversity, lower values make sampling more deterministic and greedy.

Inference

Tensor Parallelism

Splitting large tensors (weight matrices) across multiple devices and computing in parallel, enabling training of models that exceed single-device memory.

Training

TensorRT-LLM

NVIDIA's inference optimization library providing highly optimized kernels and techniques (quantization, paging, batching) for efficient LLM deployment.

Inference

Token

The smallest unit a model processes, typically a subword unit produced by tokenization. Models process token sequences rather than raw text.

Architecture

Tokenizer

An algorithm converting raw text into integer token sequences using vocabularies built through methods like BPE. Quality tokenizers are essential for model performance.

Architecture

Top-K / Top-P

Sampling methods restricting the decoding to either the K most likely tokens or the smallest set summing to probability P, balancing quality and diversity in generation.

Inference

Transformer

An architecture based on self-attention mechanisms rather than recurrence, achieving superior performance on NLP tasks and forming the foundation of modern language models.

Architecture

Triton Inference Server

An open-source inference platform supporting multiple backends and frameworks, enabling efficient model serving with features like dynamic batching and model versioning.

Inference

TRL (Transformer Reinforcement Learning)

Hugging Face's library for training language models with reinforcement learning, providing convenient implementations of PPO, DPO, and other alignment algorithms.

Training

U - Z

Vector Database

Specialized databases optimized for storing and querying high-dimensional embeddings, enabling efficient similarity search essential for RAG retrieval components.

RAG

vLLM

A high-performance inference library with PagedAttention and dynamic batching, enabling 10x+ throughput improvements for LLM serving compared to standard implementations.

Inference

Vision Transformer (ViT)

Applying transformer architecture to image patches for vision tasks, demonstrating that transformers aren't language-specific and can effectively process visual information.

Architecture

WANDA (Weights AND Activations)

A structured pruning method weighing parameters by activation magnitudes to identify less important weights, achieving 50% pruning with minimal accuracy loss.

Compression

Weight Decay

An L2 regularization technique adding penalties proportional to weight magnitudes to the loss function, preventing overfitting and improving generalization.

Training

WordPiece

A tokenization algorithm used by BERT that greedily merges frequently co-occurring character sequences, balancing vocabulary coverage with compression.

Architecture

ZeRO (Zero Redundancy Optimizer)

DeepSpeed's technique partitioning optimizer states, gradients, and parameters across devices, enabling massive model training with memory-efficient distributed strategies.

Training