A - B
Activation Function
A mathematical function applied to the output of neurons that introduces non-linearity to neural networks, enabling them to learn complex patterns. Common functions include ReLU, GELU, and SwiGLU.
Adapter
A small trainable module inserted into pretrained models that enables parameter-efficient fine-tuning. Adapters allow domain-specific adaptation without updating all model weights.
AdamW
An optimizer that combines Adam's adaptive learning rates with decoupled weight decay (L2 regularization). It's the standard optimizer for training modern large language models.
Alignment
The process of fine-tuning LLMs to behave according to human values and intentions through techniques like RLHF, SFT, and Constitutional AI, ensuring models are helpful, harmless, and honest.
ALiBi (Attention with Linear Biases)
A positional encoding method that adds linear biases to attention scores based on distance between tokens, enabling extrapolation to longer sequences without explicit position embeddings.
Attention
A mechanism that computes a weighted combination of values based on similarities between queries and keys. Self-attention allows tokens to attend to all positions, forming the core of transformers.
Autoregressive
A generation strategy where the model predicts the next token given all previous tokens. Used by most LLMs, it generates text one token at a time in a sequential manner.
AWQ (Activation-aware Weight Quantization)
A post-training quantization technique that preserves weight channels with large activation magnitudes at full precision while quantizing others, improving quantized model quality.
Backpropagation
The algorithm for computing gradients of a loss function with respect to model parameters, enabling training through iterative weight updates. It forms the foundation of neural network optimization.
Batch Size
The number of samples processed together in a single training step. Larger batches improve hardware efficiency but may affect convergence; common values are 32, 64, or higher for distributed training.
BF16 (Bfloat16)
A 16-bit floating-point format that maintains the range of FP32 while reducing precision, enabling faster training with minimal accuracy loss. Commonly used in modern large-scale training.
BLEU (Bilingual Evaluation Understudy)
A metric for evaluating machine translation quality by comparing n-gram overlap with reference translations. While widely used, it has limitations for semantic evaluation.
BPE (Byte-Pair Encoding)
A tokenization algorithm that iteratively merges the most frequent adjacent byte or character pairs, reducing vocabulary size while preserving information. Used by GPT and modern LLMs.
Beam Search
A decoding algorithm that keeps the k most likely partial hypotheses at each step, exploring multiple paths through the generation tree to find higher-quality outputs than greedy decoding.
C - D
Calibration
The process of adjusting quantization parameters (scale and zero-point) using a representative dataset to minimize accuracy loss when converting models to lower precision.
Causal Language Model (CLM)
A language model that predicts the next token based only on previous tokens, maintaining unidirectional attention flow. Used for autoregressive generation tasks like GPT models.
Chain-of-Thought (CoT)
A prompting technique that encourages LLMs to explain their reasoning step-by-step before providing final answers, often improving accuracy on complex reasoning tasks.
Chinchilla Scaling Laws
Research finding that optimal model size and data size should be scaled equally for compute-optimal training, not favoring larger models with fewer training tokens.
CLIP (Contrastive Language-Image Pre-training)
A multimodal architecture that learns joint embeddings of images and text through contrastive learning, enabling zero-shot vision and language tasks.
Constitutional AI (CAI)
An alignment technique where models are given explicit principles (constitution) to follow during training, enabling red-teaming and self-improvement without human labels.
Context Window
The maximum number of tokens a model can process in a single forward pass. Modern LLMs range from 4K to 128K+ tokens, affecting the amount of input history available.
Contrastive Learning
A training objective that maximizes similarity between positive pairs while minimizing similarity with negatives, useful for learning meaningful representations without labeled data.
Cross-Attention
An attention mechanism where queries come from one sequence and keys/values from another, enabling interaction between different modalities or sequences (e.g., image and text).
Cross-Encoder
A model architecture that jointly encodes query and document pairs to predict relevance scores. More accurate but slower than bi-encoders, often used as a reranker in RAG systems.
Curriculum Learning
A training strategy where examples are presented in increasing order of difficulty, helping models learn foundational patterns before tackling complex tasks.
Data Parallelism
A distributed training strategy where each GPU processes different data samples independently while computing the same model, aggregating gradients across devices.
DeBERTa
A transformer variant that uses disentangled attention mechanisms to separate content and position information, improving performance on NLU tasks compared to BERT.
Decoder-Only
An architecture design where only the decoder portion of a transformer is used, commonly seen in autoregressive language models like GPT. Uses causal masking to prevent attending to future tokens.
DeepSpeed
A Microsoft optimization library providing techniques like ZeRO, gradient accumulation, and mixed precision training to enable efficient large-scale model training.
Differential Privacy
A formal framework for training models on sensitive data with guarantees that individual records cannot be easily identified, important for privacy-preserving machine learning.
Distillation
A training technique where a smaller student model learns from a larger teacher model's outputs, enabling knowledge transfer and compression for deployment.
DPO (Direct Preference Optimization)
An alignment method that directly optimizes language models to prefer chosen outputs over rejected ones without requiring reward models, simplifying RLHF training.
Dropout
A regularization technique that randomly masks activations during training to prevent co-adaptation, reducing overfitting while maintaining model capacity.
E - F
ECE (Expected Calibration Error)
A metric measuring the difference between predicted confidence and actual accuracy, indicating how well calibrated a model's probability estimates are.
Embedding
A learned dense vector representation that maps discrete tokens or entities into a continuous space where semantic similarity is preserved as distance.
Encoder-Decoder
An architecture with separate encoder and decoder components, where the encoder processes input sequences and the decoder generates outputs using encoded representations.
Encoder-Only
An architecture containing only the encoder portion without an autoregressive decoder, designed for understanding tasks like classification and NER rather than generation.
Epoch
One complete pass through the entire training dataset. Models are typically trained for multiple epochs to allow sufficient learning of patterns from the data.
Expert Parallelism
A scaling strategy for Mixture-of-Experts models where different experts are distributed across different devices, enabling efficient training of very large conditional models.
Faithfulness
The degree to which a model's generated explanations or outputs accurately represent its actual reasoning and decision-making process, crucial for interpretability.
FAISS (Facebook AI Similarity Search)
An efficient library for similarity search and vector indexing at scale, enabling fast retrieval of nearest neighbors in high-dimensional spaces for RAG systems.
Feed-Forward Network (FFN)
A subcomponent of transformer blocks consisting of two linear layers with non-linear activation in between, applied position-wise to each token independently.
Few-Shot Learning
The ability to perform tasks with only a few labeled examples, leveraging in-context learning where examples are provided in the prompt.
Fine-Tuning
The process of adapting a pretrained model to a specific task or domain by continuing training on task-specific data with updated weights.
Flash Attention
An efficient attention implementation that reduces memory I/O by computing attention in blocks, enabling faster training and inference while maintaining numerical stability.
FP16 (Float16)
A 16-bit floating-point format offering smaller memory footprint than FP32, though with reduced numerical precision. Used in mixed precision training.
FP32 (Float32)
Standard 32-bit floating-point precision with wide range and sufficient precision for most training tasks, though requiring more memory than lower precisions.
FSDP (Fully Sharded Data Parallel)
A distributed training technique that shards model parameters, gradients, and optimizer states across devices, enabling efficient training of very large models with reduced per-device memory.
G - H
GeLU (Gaussian Error Linear Unit)
A smooth activation function approximating the cumulative distribution function of a standard normal distribution, widely used in modern transformers like BERT and GPT.
Generative Pre-trained Transformer (GPT)
A decoder-only transformer architecture trained on large text corpora using causal language modeling. Foundation for GPT series (GPT-3, GPT-4) and many open-source models.
GGUF (GPT Generated Unified Format)
A model format designed for efficient quantized storage and inference, supporting multiple precision formats and enabling effective deployment on consumer hardware.
Gradient Accumulation
A technique where gradients are accumulated over multiple forward-backward passes before updating weights, enabling larger effective batch sizes with limited memory.
Gradient Checkpointing
A memory optimization technique that recomputes activations during backpropagation rather than storing them, reducing memory usage at the cost of additional computation.
Graph RAG
A RAG approach combining knowledge graphs with text retrieval, enabling structured reasoning over both entity relationships and textual content.
Grounding
The process of anchoring model outputs to factual information from external sources, typically through retrieval-augmented generation to reduce hallucinations.
GPTQ
A post-training quantization method using second-order information that preserves accuracy with aggressive quantization (1-4 bits) for efficient model compression.
Grouped-Query Attention (GQA)
An attention variant where multiple query heads share single key and value heads, reducing memory and computation while maintaining quality compared to Multi-Head Attention.
Hallucination
When LLMs generate plausible-sounding but factually incorrect or fabricated information, a major challenge mitigated through retrieval augmentation and alignment training.
Hidden State
The internal representation computed at each layer of a neural network for a given input, capturing extracted features that are progressively refined through deeper layers.
HITL (Human-in-the-Loop)
A system design where humans and AI collaborate iteratively, with humans reviewing and correcting AI outputs to improve model performance and safety.
Hugging Face
A major platform providing model hubs, transformers library, and datasets, becoming the de facto standard for sharing and using pretrained language models.
I - K
In-Context Learning
The ability of LLMs to perform tasks by conditioning on examples or instructions in the prompt without updating model weights, a key capability emerging in large models.
Instruction Tuning
Fine-tuning LLMs on datasets of instructions paired with desired outputs, enabling models to better follow explicit user instructions across diverse tasks.
INT4 / INT8
Integer quantization formats using 4-bit or 8-bit integers to represent weights, drastically reducing memory and computation for efficient inference while maintaining reasonable accuracy.
KL Divergence
A measure of how one probability distribution diverges from a reference distribution, commonly used in alignment training to prevent models from deviating too far from base models.
Knowledge Distillation
A training method where a smaller student model learns to mimic a larger teacher model's outputs, enabling compression while retaining knowledge and performance.
Knowledge Graph
A structured representation of entities and their relationships, used in RAG systems to enable semantic reasoning and context-aware information retrieval.
KV Cache (Key-Value Cache)
A memory optimization during inference that caches previously computed key and value projections, enabling efficient autoregressive decoding without recomputation.
L - M
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method adding trainable low-rank decompositions to model weights, reducing trainable parameters by orders of magnitude while maintaining quality.
Latency
The time required to generate outputs, critical for user-facing applications. Lower latency improves user experience but may require model compression or architectural innovations.
LayerNorm
A normalization technique that normalizes activations across feature dimensions within a layer, stabilizing training and improving convergence in deep networks.
Leiden Algorithm
A graph clustering algorithm often used in knowledge graph partitioning for efficient chunking and retrieval in Graph RAG systems.
LLMLingua
A technique for intelligently compressing prompts while preserving information density, reducing context length and inference costs in RAG systems.
Logits
The raw unnormalized output scores produced by a model before softmax, representing the model's preferences for each possible token before converting to probabilities.
Mamba (SSM)
A state-space model architecture offering linear-time complexity per token, positioning as an alternative to transformers with better scaling properties and efficiency.
Masked Language Model (MLM)
A training objective where random tokens are masked and the model predicts them from context, used in BERT and encoder-only models for learning bidirectional representations.
Megatron-LLM
An NVIDIA library providing efficient implementations of tensor parallelism, pipeline parallelism, and other distributed training techniques for large-scale model training.
MMLU (Massive Multitask Language Understanding)
A comprehensive benchmark spanning 57 diverse knowledge domains, widely used to evaluate general knowledge and reasoning capabilities of language models.
Mixture of Experts (MoE)
An architecture where different expert networks specialize on different input types, selected by a gating network, enabling scaling without proportional increase in computation.
Multi-Head Attention (MHA)
Attention mechanism using multiple parallel attention heads, each learning different attention patterns, combining head outputs via concatenation.
Multi-Query Attention (MQA)
An attention variant where all query heads share a single key and value head, reducing memory and computation during inference while maintaining near-original quality.
MTEB (Massive Text Embedding Benchmark)
A comprehensive benchmark evaluating text embedding models across diverse tasks (classification, clustering, retrieval), enabling standardized comparison of embedding quality.
N - P
N:M Sparsity
A structured pruning pattern ensuring exactly M non-zero weights per N weights, compatible with specialized hardware for efficient inference of sparse models.
Natural Language Inference (NLI)
A task involving determining whether given premises entail, contradict, or are neutral to hypotheses, testing logical reasoning in language understanding.
Next-Token Prediction
The core training objective of causal language models predicting the next token given previous tokens, the foundation for autoregressive generation.
Normalization
A class of techniques (layer norm, batch norm, group norm) that standardize activations, stabilizing training and improving convergence in deep networks.
ONNX (Open Neural Network Exchange)
An open standard format for representing machine learning models, enabling model portability and optimization across different frameworks and hardware.
Optimizer
An algorithm for updating model parameters based on gradients. Common optimizers (SGD, Adam, AdamW) balance convergence speed, generalization, and computational efficiency.
PagedAttention
A memory management technique for KV caches treating them as pages, enabling efficient sharing and dynamic memory allocation during batch inference.
Parameter-Efficient Fine-Tuning (PEFT)
A class of techniques (LoRA, adapters, prefix tuning) enabling effective fine-tuning with significantly fewer trainable parameters than full fine-tuning.
Perplexity
A metric measuring how surprised a language model is by test data, computed as the exponentiated cross-entropy loss. Lower perplexity indicates better modeling of text.
Pipeline Parallelism
A distributed training strategy splitting model layers across devices with micro-batches, enabling training of very large models by overlapping computation and communication.
Post-Training Quantization (PTQ)
Applying quantization after model training completes using calibration datasets, more convenient than QAT but potentially with lower quality due to lack of training adaptation.
PPO (Proximal Policy Optimization)
A reinforcement learning algorithm widely used in alignment training (RLHF) for optimizing language models against reward models, balancing exploration and exploitation.
Prefix Tuning
A parameter-efficient method prepending trainable prefix tokens to inputs, enabling task adaptation without modifying model weights, useful for multi-task scenarios.
Prompt Tuning
A method learning task-specific continuous prompts through gradient descent rather than hand-crafting discrete text prompts, enabling efficient task-specific adaptation.
Pruning
Removing less important weights or neurons from trained models to reduce size and computation while maintaining performance, enabling efficient deployment.
Q - R
QAT (Quantization-Aware Training)
Including quantization simulation during training so models learn to work effectively at lower precisions, generally achieving better quality than post-training quantization.
QLoRA
Combines quantization with LoRA for parameter-efficient fine-tuning, enabling training of 65B models on single 48GB GPUs through 4-bit quantization.
Quantization
Converting high-precision weights and activations to lower precision (int8, int4, etc.), reducing memory and computation while maintaining acceptable accuracy.
RAG (Retrieval-Augmented Generation)
Augmenting language models with external knowledge by retrieving relevant documents at generation time, reducing hallucinations and enabling knowledge updates without retraining.
RAGAS (Retrieval-Augmented Generation Assessment)
A framework for evaluating RAG system quality across multiple dimensions (faithfulness, relevance, coherence), enabling systematic comparison of RAG implementations.
Recall@K
A retrieval metric measuring what fraction of relevant documents appear in the top K results, essential for evaluating retrieval component quality in RAG systems.
RECOMP
A compression method for in-context learning that distills long documents into short summaries while preserving task-relevant information for more efficient processing.
Red-Teaming
Systematic adversarial testing where teams actively try to break or misuse AI systems to identify vulnerabilities and failure modes before deployment.
Reranker
A cross-encoder model that re-ranks initially retrieved candidates by jointly encoding queries and documents, improving precision in multi-stage retrieval pipelines.
Residual Connection
Shortcuts in neural networks enabling gradients to flow directly through layers via skip connections (x + f(x)), crucial for training deep networks.
Reward Model
A learned model estimating preference scores for different outputs, used in RLHF to guide language model training toward human preferences without per-example annotation.
RLHF (Reinforcement Learning from Human Feedback)
An alignment technique using human feedback to train reward models, then optimizing language models via RL to maximize rewards, aligning outputs with human values.
RMSNorm
A simplified layer normalization that normalizes by RMS instead of standardization, computationally simpler and more stable, preferred in modern models like LLaMA.
RoPE (Rotary Position Embedding)
A position encoding method rotating query and key vectors by angles proportional to token positions, enabling extrapolation to longer sequences than training.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Metrics comparing n-gram overlap between generated and reference summaries, widely used in evaluating abstractive summarization despite known limitations.
S - T
Sampling (Temperature, Top-K, Top-P)
Decoding strategies where temperature controls randomness, top-K selects from K highest probability tokens, and top-P samples from the smallest set with cumulative probability P.
Scaling Laws
Empirical relationships showing how model performance improves with increases in model size, dataset size, and compute, predicting optimal resource allocation for training.
Self-Attention
Attention mechanism where a sequence attends to itself, with queries, keys, and values all derived from the same input, enabling modeling of dependencies within sequences.
Self-Consistency
A decoding method sampling multiple diverse outputs and selecting the most consistent answer, improving reasoning accuracy by leveraging ensemble effects.
Self-Distillation
Using a model as both teacher and student to improve quality, where the model learns from its own ensemble or better-sampled outputs, enabling self-improvement.
SentencePiece
A language-agnostic tokenization library implementing BPE and unigram language models, widely used for training tokenizers in modern multilingual models.
Sequence Parallelism
Distributing sequence length across devices rather than just batch dimension, enabling training with longer sequences while reducing per-device memory requirements.
SFT (Supervised Fine-Tuning)
Fine-tuning on high-quality demonstration data where inputs are paired with desired outputs, enabling models to follow instructions and improve task performance.
Sliding Window Attention
An attention pattern where each token only attends to a fixed window of surrounding tokens, reducing computation from quadratic to linear while maintaining local context.
Softmax
A function converting logits to probability distributions by exponentiating and normalizing, converting attention scores to weights that sum to 1.
SparseGPT
A one-shot pruning method removing weights based on Hessian-weighted importance scores, enabling aggressive sparsity without retraining while maintaining performance.
Speculative Decoding
An inference acceleration technique where a smaller draft model generates candidates and a larger model validates them in parallel, reducing latency without quality loss.
SwiGLU
An activation function combining Swish activation with gating mechanisms in feedforward networks, improving training dynamics and convergence in transformers.
Temperature
A hyperparameter controlling output randomness during decoding. Higher temperatures increase diversity, lower values make sampling more deterministic and greedy.
Tensor Parallelism
Splitting large tensors (weight matrices) across multiple devices and computing in parallel, enabling training of models that exceed single-device memory.
TensorRT-LLM
NVIDIA's inference optimization library providing highly optimized kernels and techniques (quantization, paging, batching) for efficient LLM deployment.
Token
The smallest unit a model processes, typically a subword unit produced by tokenization. Models process token sequences rather than raw text.
Tokenizer
An algorithm converting raw text into integer token sequences using vocabularies built through methods like BPE. Quality tokenizers are essential for model performance.
Top-K / Top-P
Sampling methods restricting the decoding to either the K most likely tokens or the smallest set summing to probability P, balancing quality and diversity in generation.
Transformer
An architecture based on self-attention mechanisms rather than recurrence, achieving superior performance on NLP tasks and forming the foundation of modern language models.
Triton Inference Server
An open-source inference platform supporting multiple backends and frameworks, enabling efficient model serving with features like dynamic batching and model versioning.
TRL (Transformer Reinforcement Learning)
Hugging Face's library for training language models with reinforcement learning, providing convenient implementations of PPO, DPO, and other alignment algorithms.
U - Z
Vector Database
Specialized databases optimized for storing and querying high-dimensional embeddings, enabling efficient similarity search essential for RAG retrieval components.
vLLM
A high-performance inference library with PagedAttention and dynamic batching, enabling 10x+ throughput improvements for LLM serving compared to standard implementations.
Vision Transformer (ViT)
Applying transformer architecture to image patches for vision tasks, demonstrating that transformers aren't language-specific and can effectively process visual information.
WANDA (Weights AND Activations)
A structured pruning method weighing parameters by activation magnitudes to identify less important weights, achieving 50% pruning with minimal accuracy loss.
Weight Decay
An L2 regularization technique adding penalties proportional to weight magnitudes to the loss function, preventing overfitting and improving generalization.
WordPiece
A tokenization algorithm used by BERT that greedily merges frequently co-occurring character sequences, balancing vocabulary coverage with compression.
ZeRO (Zero Redundancy Optimizer)
DeepSpeed's technique partitioning optimizer states, gradients, and parameters across devices, enabling massive model training with memory-efficient distributed strategies.