LLM Fine-Tuning

Methods, Alignment & Production Deployment

5x
Speedup with LoRA
100%
Parametric Control
RLHF
State-of-the-art

Overview

Adapting large language models for specific domains and behaviors

What is Fine-Tuning?

Fine-tuning is the process of updating LLM parameters on task-specific or domain-specific data to adapt pre-trained models for specialized use cases. Unlike prompt engineering, fine-tuning modifies model weights to learn new patterns, vocabulary, and behaviors.

Domain Adaptation Instruction Following Preference Learning Personalization

When to Fine-Tune

  • Specialized domain (finance, legal, medical)
  • Custom instruction format or style
  • Preference alignment with human feedback
  • Performance on benchmark tasks
  • Cost reduction via smaller models

Key Use Cases

  • Healthcare: clinical documentation
  • Finance: risk analysis, compliance
  • Legal: contract review, due diligence
  • Customer support: brand voice
  • Chatbots: user interaction patterns

Methods Comparison

Trade-offs between different fine-tuning approaches

Method Trainable Params Compute Cost Inference Impact Use Case
Full SFT 100% Very High None Large resource budget
LoRA 0.1–1% Low Minimal Fast iteration, multiple tasks
Adapters 0.1–1% per layer Low Minimal Modular tuning
Prefix Tuning 0.01% Very Low Slightly Slower Rapid prototyping
Prompt Tuning <0.01% Very Low None Minimal intervention
RLHF/PPO 100% (usually) Very High None Alignment with preferences
DPO 100% Moderate None Preference learning (simplified)

Full SFT

Supervised Fine-Tuning with comprehensive parameter updates

Supervised Fine-Tuning (SFT)

Full SFT updates all model parameters via cross-entropy loss on input-output pairs. It's the foundation for all fine-tuning and enables the largest capability improvements.

Advantages

  • Maximum performance gains
  • No inference overhead
  • Full architectural flexibility
  • Proven at scale (ChatGPT)

Challenges

  • High GPU memory (80GB+)
  • Long training time (days)
  • Catastrophic forgetting risk
  • Expensive at scale

Training Configuration

# SFT Configuration Example model_name: "falcon-7b" learning_rate: 2e-5 batch_size: 16 max_epochs: 3 warmup_steps: 500 max_seq_length: 2048 gradient_accumulation: 4 dtype: "bfloat16"
Catastrophic Forgetting

SFT can degrade base model capabilities. Mitigate with: (1) Replay buffers mixing base data, (2) Lower learning rates, (3) Early stopping via validation set.

LoRA & PEFT

Parameter-Efficient Fine-Tuning with low-rank adapters

Low-Rank Adaptation (LoRA)

Instead of updating all W parameters, LoRA adds trainable low-rank decomposition matrices (A, B) to attention and feed-forward layers. Key insight: fine-tuning updates are inherently low-rank.

Parameter-Efficient Fast Iteration Multi-Task
LoRA Mechanism: W' = W + AB^T Original Weight W ∈ R^(d×k) (Frozen) d=4096, k=4096 LoRA Update AB^T ∈ R^(d×k) (Trainable) r=8 → 64K params + Final Weight W' = W + AB^T Merged for inference No inference cost

LoRA Hyperparameters

  • rank (r): 4-16 typical, 8 recommended
  • alpha (α): scaling factor, usually α = 2r
  • dropout: 0.05-0.1 for regularization
  • target_modules: q,v for efficiency

Benefits over Full SFT

  • 5-10x faster training
  • 10-20x smaller adapters
  • No inference cost when merged
  • Mix multiple task adapters

LoRA Configuration (HuggingFace PEFT)

from peft import get_peft_model, LoraConfig config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(base_model, config)

RLHF & DPO

Aligning models with human preferences

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a three-stage pipeline that aligns models with human preferences using reward signals and PPO optimization.

Stage 1: SFT Fine-tune on curated instruction-response pairs (100K-1M examples) Stage 2: Reward Model Train classifier: which response is better? (ranking) (50K-100K pairs) Stage 3: PPO Optimize SFT model with reward signal via PPO (policy) (KL penalty to prevent drift) Timeline (typical): Stage 1 (SFT): 1-2 days on 1x8 A100 Stage 2 (Reward): 1 day on 1x8 A100 Stage 3 (PPO): 3-5 days on 8 A100s RLHF Cost Multiplier ~10x cost of SFT alone (due to sampling during PPO)

PPO Hyperparameters

  • learning_rate: 1e-5 (smaller)
  • batch_size: 4-8 (smaller)
  • kl_penalty: 0.1-0.2
  • clip_ratio: 0.2 (PPO clipping)
  • epochs: 3-4 per batch

DPO: Direct Preference Optimization

Simpler alternative that skips reward model. Directly optimize on preference pairs (chosen vs. rejected) without RL. ~3x faster than RLHF, comparable results.

  • Single-stage training
  • No sampling complexity
  • Empirically stable
Reward Hacking & KL Divergence

Models can exploit reward function or diverge from base model. Address with: (1) Reference model KL penalty, (2) Validation on held-out prompts, (3) Careful reward function design.

Instruction Tuning

Aligning models to follow user instructions effectively

What is Instruction Tuning?

Instruction tuning is SFT on diverse task instruction-response pairs. Goal: teach model to interpret instructions and produce relevant outputs across many tasks, improving generalization to unseen instructions.

Key Characteristics

  • Diverse task coverage (100-1000+ tasks)
  • Explicit instruction format
  • Quality output examples
  • Enables zero-shot generalization

Popular Datasets

  • Alpaca: 52K examples via text-davinci-003
  • Flan: 1.3M examples, 146 tasks
  • SuperNatural: 1000+ tasks
  • Custom: domain-specific instructions

Example Instruction Data Format

# JSON format for instruction dataset { "instruction": "Translate the following text to French", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?" } # With system prompt (for newer models) { "messages": [ {"role": "system", "content": "You are a helpful translator."}, {"role": "user", "content": "Translate to French: Hello"}, {"role": "assistant", "content": "Bonjour"} ] }
Instruction Quality Matters

Models learn to follow instruction patterns. High-quality, diverse instructions yield better generalization. Use 20-50 examples per task if custom-building datasets.

Data Strategies

Curation, augmentation, and quality control

Data Curation & Labeling

  • Human experts: Domain specialists label high-quality examples
  • Synthetic data: Use LLMs (e.g., GPT-4) to generate examples
  • Cost: $5-50 per label depending on complexity

Data Augmentation

  • Paraphrasing: Rephrase inputs/outputs
  • Back-translation: Translate A→B→A
  • Instruction variation: Multiple phrasings
  • Multiplier: 2-5x dataset size

Data Quality Pipeline

# Data quality filtering with spaCy and regex import spacy from presidio_analyzer import AnalyzerEngine nlp = spacy.load("en_core_web_sm") analyzer = AnalyzerEngine() def filter_pii(text): results = analyzer.analyze(text=text, language="en") return len(results) == 0 def filter_quality(text): # Length checks if len(text.split()) < 3: return False # PII check if not filter_pii(text): return False return True

Human-in-the-Loop (HITL)

Iterative refinement:

  1. Initial model predicts
  2. Human reviews/corrects
  3. Use corrections to retrain
  4. Converge to domain expertise

Continual Data Updates

Keep models fresh:

  • Monthly/quarterly retraining
  • Replay buffer (old + new data)
  • Federated learning for privacy

Loss Functions

Objectives for training, alignment, and evaluation

Loss Type Formula / Description Use Case
Cross-Entropy (SFT) CE = -Σ log P(y_t | x, y_{ Supervised fine-tuning
PPO Objective L = E[min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t)] - λKL Preference alignment (RLHF)
DPO Loss L = -log σ(β log(π(y_w)/π_ref(y_w)) - β log(π(y_l)/π_ref(y_l))) Direct preference learning
Reward Model (Ranking) L = -log σ(r(y_w) - r(y_l)) Training reward models for RLHF
Multi-Task Loss L = Σ_i w_i * L_i Training on multiple tasks
KL Penalty L_kl = KL(π_new || π_ref) Prevent divergence from base

Cross-Entropy Loss

Standard for SFT. Penalizes incorrect token predictions. Works well for next-token prediction.

import torch loss_fn = torch.nn.CrossEntropyLoss() logits = model(input_ids).logits loss = loss_fn(logits.view(-1, vocab_size), labels.view(-1))

DPO Loss Implementation

Directly optimize preference pairs without reward model.

def dpo_loss(logps_w, logps_l, beta=0.5): diff = beta * (logps_w - logps_l) return -torch.log( torch.sigmoid(diff) ).mean()
Loss Function Selection Matters

Wrong loss can lead to reward hacking (RLHF), poor generalization (multi-task), or divergence (no KL). Match loss to objective and validate on held-out test sets.

Evaluation

Metrics, benchmarks, and validation strategies

Language Metrics

  • Perplexity: Lower is better
  • BLEU: Sequence similarity
  • ROUGE: Coverage overlap
  • F1: Classification tasks

Safety & Alignment

  • SafeBench: Adversarial safety
  • TruthfulQA: Factuality
  • Human eval: Likert scales
  • Calibration (ECE): Confidence

Production Metrics

  • Latency: p50/p95/p99
  • Throughput: tokens/sec
  • Cost: $ per request
  • Drift: Performance over time

Evaluation Suite Example

from datasets import load_dataset from evaluate import load # Load benchmarks mmlu = load_dataset("cais/mmlu", "all") truthful_qa = load_dataset("truthful_qa") # Metrics rouge = load("rouge") bleu = load("bleu") accuracy = load("accuracy") # Validation loop def evaluate_model(model, dataset): results = {} for batch in dataset: preds = model.generate(batch) score = rouge.compute(predictions=preds) return results
Validation Strategy

Use hold-out test set (10-20% of data). Measure on task-specific metrics AND safety/alignment. Early stopping on validation loss or custom metric to prevent overfitting.

Deployment & MLOps

Production infrastructure and CI/CD pipelines

Serving Options

  • REST API: Simple HTTP endpoints
  • gRPC: Low-latency, streaming
  • Embedded: On-device (mobile/edge)
  • Batch: Offline processing

Inference Frameworks

  • Triton Inference: Multi-backend GPU
  • Ray Serve: Distributed scaling
  • BentoML: Model packager
  • KServe: K8s-native serving

MLOps Pipeline (Continuous Training)

# Example: GitHub Actions + MLflow + Ray name: Fine-Tuning CI/CD on: schedule: - cron: '0 0 * * 0' # Weekly retraining jobs: train: steps: - name: Fetch new data run: python fetch_data.py - name: Train with Ray Tune run: ray train --config lora_config.yaml - name: Evaluate run: python evaluate.py - name: Register to MLflow run: mlflow models register - name: A/B test canary run: kubectl apply -f canary.yaml

Model Registry

  • MLflow: Version tracking
  • HuggingFace Hub: Community sharing
  • W&B: Experiment logging
  • Metadata: metrics, params, tags

Monitoring & Alerts

  • Latency p95 < 500ms
  • Cost per request tracking
  • Hallucination detection
  • Reward signal drift

Cost Analysis & ROI

Complete training and inference cost breakdown with 2026 GPU pricing

$1.50-3.00
A100/H100 per hour
5-50x
LoRA Cost Reduction
$0.05-5.00
API cost per 1M tokens
1-4 weeks
Typical Break-Even

Cloud GPU Pricing (March 2026)

GPU VRAM On-Demand $/hr Spot/Preemptible Best For
NVIDIA A100 40GB 40GB $1.50-2.30 $0.78-1.20 LoRA fine-tuning 7-13B models
NVIDIA A100 80GB 80GB $2.00-3.00 $1.00-1.80 Full SFT 7B, QLoRA 70B
NVIDIA H100 SXM 80GB $2.40-4.00 $1.50-2.50 Fast training, RLHF, large-batch DPO
4× A100 80GB 320GB $8.00-12.00 $4.00-7.00 Full SFT 70B, multi-GPU RLHF
8× H100 SXM 640GB $20.00-32.00 $12.00-20.00 Full SFT 70B+, production RLHF pipelines

Training Cost by Method & Model Size

Method Model Size GPU Setup GPU-Hours Cloud Cost Data Prep Total
LoRA (r=8) 7-8B 1× A100 40GB 4-8 hrs $6-18 $500-2K $500-2K
QLoRA (4-bit) 70B 1× A100 80GB 8-16 hrs $16-48 $1-3K $1-3K
Full SFT 7-8B 2× A100 80GB 20-50 hrs $80-300 $1-3K $1-3.5K
Full SFT 70B 8× H100 50-150 hrs $1-5K $2-5K $3-10K
DPO 7-8B 2× A100 80GB 20-50 hrs $80-300 $2-5K $2-5.5K
RLHF (PPO) 7-8B 4× A100 80GB 200-500 hrs $1.6-5K $5-15K $7-20K

Fine-Tuning vs API: Break-Even Analysis

Scenario: 100K queries/day

API route (GPT-4o-mini): ~500 tokens/query × 100K = 50M tokens/day
Cost: $0.15/1M input + $0.60/1M output ≈ $37.50/day = $1,125/month

Fine-tuned Llama-3.3-8B (self-hosted):
1× A100 80GB at $2/hr = $1,440/month + one-time fine-tune $2K

Fine-tuned (quantized INT4):
1× A40 48GB at $0.80/hr = $576/month + one-time fine-tune $2K

Break-even: Quantized fine-tuned model pays back in ~4 months, then saves $549/month (49%)

Scenario: 1M queries/day

API route (GPT-4o-mini):
500M tokens/day ≈ $375/day = $11,250/month

Fine-tuned Llama-3.3-8B (vLLM, 4× A100):
$8/hr × 24 × 30 = $5,760/month + one-time $3K

Break-even: 18 days. Annual savings: $62,880 (46%). Plus: lower latency, data privacy, no vendor lock-in.

Cost Optimization Strategies

Training Savings

  • LoRA/QLoRA: 5-50x cheaper
  • Spot instances: 40-70% off
  • BF16/FP16: 2x memory efficiency
  • Gradient accum: Fewer GPUs
  • Flash Attention 2: 2-3x speedup

Inference Savings

  • INT4 quantization: 4x less VRAM
  • vLLM batching: 10-50x throughput
  • KV-cache: Reduce recomputation
  • Speculative decoding: 2-3x faster
  • Response caching: Skip repeated

Data Savings

  • Synthetic data: 10x cheaper than human
  • Active learning: Label only hard cases
  • Curriculum: Fewer epochs needed
  • DPO vs RLHF: 40-75% cheaper
  • Distill then tune: Smaller base
Decision Framework

<10K queries/day: Use API (GPT-4o-mini or Claude Haiku at $0.25/1M). 10K-100K/day: Fine-tune with LoRA + quantize, self-host on 1 GPU. 100K-1M/day: Full SFT or DPO + vLLM on 2-4 GPUs — saves 40-60% vs API. >1M/day: Fine-tune + quantize + dedicated cluster — saves 60-80% vs API, payback in 2-4 weeks.

Safety & Alignment

Ensuring models behave safely and responsibly

Safety Risks

  • Overconfidence after RLHF
  • Reward hacking (gaming objective)
  • Capability regression
  • Adversarial jailbreaks
  • Bias amplification

Mitigation Strategies

  • Constitutional AI: Encode safety rules
  • Red-teaming: Adversarial testing
  • Calibration: Uncertainty estimates
  • KL penalties: Limit divergence

Constitutional AI: Principle-Based Alignment

# Constitutional AI: Define safety principles principles = [ "Be helpful and harmless", "Don't provide medical advice", "Refuse illegal requests", "Acknowledge uncertainty" ] # Red team: Critique unsafe responses critique_prompt = """ Identify any harmful or unsafe content in the response. Rate on scale 1-5 where 5 is most harmful. """ # Revise: Model improves response based on critique revision_prompt = """ Provide an improved response that adheres to our principles. """
Adversarial Testing

Use benchmarks like AdvBench (100 jailbreak attempts) and OWASP Top 10 for GenAI. Test for: prompt injection, model extraction, bias amplification, hallucinations.

Privacy & Compliance

Data governance, regulation, and secure training

PII Handling

  • Detection: spaCy NER, Presidio
  • Redaction: Mask PII tokens
  • Audit trails: Log data access
  • Retention: Auto-delete policies

Privacy Techniques

  • Differential Privacy: Opacus, TensorFlow Privacy
  • Federated Learning: Train on-device
  • Secure Enclaves: Hardware isolation
  • Data minimization: Collect only needed

Compliance Requirements

GDPR

Right to erasure, consent, data portability

HIPAA

Healthcare: encryption, audit logs, BAAs

Model Licensing

Commercial use, CC, proprietary restrictions

Privacy by Design

Plan privacy early: data minimization, consent management, retention policies. Use privacy-preserving techniques (DP, federated) for sensitive domains.

Failure Modes

Common pitfalls and mitigation strategies

Catastrophic Forgetting

Base model capabilities degrade during fine-tuning on specific data.

Fix: Replay buffer mixing base data (10-20%), lower LR, early stopping

Overfitting to Feedback

Model exploits reward function instead of learning genuine behavior.

Fix: KL penalty, diverse reward signals, hold-out test set

Bias & Toxicity Regression

Fine-tuning amplifies harmful biases or toxicity in specific domains.

Fix: Balanced datasets, debiasing, red-teaming, fairness metrics

PPO Instability

PPO training diverges or oscillates in loss/reward.

Fix: Smaller LR (1e-5), fewer PPO epochs, monitor KL divergence

Safety Regressions

Model becomes less safe or generates harmful content after tuning.

Fix: Safety-balanced data, adversarial eval, constitutional AI

Pipeline Failures

Data preprocessing, training, or serving breaks silently.

Fix: Unit tests, integration tests, monitoring, alerting

Implementation Roadmap

5-phase plan from planning to production

Phase 1: Planning & Data Prep

Define objectives, audit/cleanse data, set up MLOps.

1-2 weeks Planning Data audit
  • Document fine-tuning goals and success metrics
  • Audit dataset for PII, quality, diversity
  • Set up MLflow, W&B, or similar tracking
  • Plan compute budget and timeline

Phase 2: Prototype & Baseline

SFT baseline, evaluate, trial LoRA, measure trade-offs.

2-4 weeks SFT + LoRA Eval
  • Train small SFT baseline (7B-13B model)
  • Evaluate on task-specific metrics + safety
  • Trial LoRA with same data, compare speed/quality
  • Document results and decide method

Phase 3: Alignment & Safety

Human feedback, reward model, PPO/DPO, red-teaming.

4-8 weeks RLHF/DPO Red-team
  • Collect human preference data (5K-20K pairs)
  • Train reward model if using RLHF
  • Run PPO/DPO training with KL penalties
  • Red-team for jailbreaks, adversarial examples

Phase 4: Optimization & Testing

Quantization, end-to-end evaluation, A/B testing.

2-4 weeks Quantization A/B test
  • Apply quantization (INT8, GPTQ) for faster inference
  • End-to-end latency and throughput testing
  • Canary deployment (5-10% traffic)
  • A/B test vs. baseline model, measure win rate

Phase 5: Production Deployment

Full production, monitoring, continuous retraining.

Ongoing CI/CD Monitoring
  • Deploy to production with versioning
  • Monitor latency, cost, accuracy, drift
  • Set up continuous retraining on new data
  • Plan quarterly reviews and updates

Tools & References

Essential libraries, frameworks, and resources

Training Libraries

  • HuggingFace Transformers: Base models, Trainer API
  • PEFT: LoRA, Adapters, Prefix tuning
  • Accelerate: Distributed training, quantization
  • DeepSpeed: ZeRO, optimization
  • Ray Tune: Hyperparameter tuning

Serving & Deployment

  • BentoML: Model packaging, REST API
  • MLflow: Model registry, serving
  • Ray Serve: Distributed inference
  • KServe: Kubernetes-native
  • TensorRT-LLM: NVIDIA optimization

Monitoring & Evaluation

  • Weights & Biases: Experiment logging
  • TensorBoard: Metrics visualization
  • Hugging Face Evaluate: Benchmarks
  • OpenCompass: LLM leaderboards
  • HELM: Safety evaluation

Data & Infrastructure

  • HuggingFace Datasets: Data loading
  • Presidio: PII detection
  • spaCy: NLP utilities
  • Kubernetes: Container orchestration
  • Airflow: Workflow management

Key Papers & Resources

Foundational

  • LoRA: Low-Rank Adaptation (Hu et al.)
  • Instruction Tuning with FLAN (Wei et al.)
  • Training language models to follow instructions (InstructGPT)

Advanced

  • Direct Preference Optimization (Rafailov et al.)
  • Constitutional AI (Bai et al.)
  • Scaling Laws & Chinchilla (DeepMind)
Recommended Starting Stack

SFT: Transformers Trainer + PEFT (LoRA) + W&B. Serving: BentoML or Ray Serve. Monitoring: W&B + custom dashboards. This covers 80% of use cases with minimal overhead.

HuggingFace Base & SFT Models

Foundation Models for Fine-Tuning Across Use Cases

Large Models (13B+) — Maximum Quality

Model Params Context MMLU License Best For
meta-llama/Llama-3.1-70B 70B 128K ~86 Llama 3.1 Best open teacher; general SFT for all domains
Qwen/Qwen3-32B 32B 128K ~83 Apache 2.0 Multilingual SFT; strong reasoning; permissive license
Qwen/Qwen3-30B-A3B 30B (3B active) 262K ~82 Apache 2.0 MoE: 30B quality at 3B inference cost; RAG generation
mistralai/Mistral-Large-2 123B 128K ~84 Mistral Code + reasoning; enterprise applications
microsoft/Phi-4 14B 16K ~78 MIT High quality per parameter; reasoning-focused

Small Models (1B-8B) — Cost-Efficient Fine-Tuning

Model Params MMLU HumanEval LoRA VRAM Best For
meta-llama/Llama-3.3-8B 8B 73.0 72.6 ~16GB Best all-around 8B; recommended starting point
Qwen/Qwen3-8B 8B ~72 ~75 ~16GB Best code generation at 8B; strong multilingual
HuggingFaceTB/SmolLM3-3B 3B ~67 ~58 ~8GB Best 3B model; full training blueprint published
microsoft/Phi-4-mini-instruct 3.8B ~70 ~66 ~10GB Edge deployment; reasoning-heavy tasks
google/gemma-2-9b 9B ~71 ~64 ~18GB Google ecosystem; good instruction following
mistralai/Mistral-7B-v0.3 7B ~63 ~40 ~14GB Sliding window attention; fast inference
Qwen/Qwen3-1.7B 1.7B ~55 ~35 ~4GB Ultra-lightweight; mobile/IoT fine-tuning
# Quick-start: LoRA fine-tuning with any base model from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig # Choose your base model model_name = "meta-llama/Llama-3.3-8B" # or Qwen3-8B, SmolLM3-3B, etc. model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) # Apply LoRA lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) print(model.print_trainable_parameters()) # ~0.2% of total params
Model Selection Decision Tree

Budget GPU (16GB): Llama-3.3-8B or Qwen3-8B with QLoRA. Minimal GPU (8GB): SmolLM3-3B or Phi-4-mini with QLoRA 4-bit. Maximum quality: Llama-3.1-70B with QLoRA on 2×A100. Multilingual: Qwen3 family (all sizes). Edge/mobile: Qwen3-1.7B or Qwen3-0.6B.

HuggingFace RAG & Embedding Models

Models for Retrieval-Augmented Fine-Tuning

Embedding Models (Fine-Tunable)

Model Params Dims MTEB Fine-Tune Use Case
Alibaba/Qwen3-Embedding-8B 8B 32-4096 70.58 (#1) Domain-specific retrieval with custom dims
BAAI/bge-m3 568M 1024 ~66 Multilingual RAG; dense + sparse + multi-vector
jinaai/jina-embeddings-v3 570M 1024 ~65 Multi-task fine-tuning (retrieval + classification)
BAAI/bge-base-en-v1.5 109M 768 ~63 English-only domain adaptation; fast fine-tuning
sentence-transformers/all-MiniLM-L6-v2 22M 384 ~56 Ultra-fast; fine-tune for domain similarity tasks
sentence-transformers/all-mpnet-base-v2 109M 768 ~60 Best sentence-transformer; STS fine-tuning

Reranker Models (Fine-Tunable)

Model Params Context Fine-Tune Use Case
mixedbread-ai/mxbai-rerank-large-v2 1.5B 8K Domain reranking; 100+ languages; RL-trained baseline
BAAI/bge-reranker-v2-m3 568M 8K Multilingual reranking fine-tuning
BAAI/bge-reranker-base 278M 512 Lightweight domain reranker; fast fine-tuning
cross-encoder/ms-marco-MiniLM-L-6-v2 22M 512 Ultra-fast reranker; MS-MARCO pre-trained
colbert-ir/colbertv2.0 110M 512 Late-interaction retrieval; use with RAGatouille
# Fine-tune embedding model for domain-specific retrieval from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader # Load pre-trained embedding model model = SentenceTransformer("BAAI/bge-base-en-v1.5") # Domain-specific training pairs train_examples = [ InputExample(texts=["patient symptoms", "clinical presentation"], label=0.9), InputExample(texts=["patient symptoms", "stock market"], label=0.1), ] # Contrastive loss for retrieval train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) train_loss = losses.CosineSimilarityLoss(model) # Fine-tune model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path="./domain-embeddings" )
RAG Fine-Tuning Strategy

Embedding: Fine-tune bge-base-en-v1.5 on domain pairs with contrastive loss for 3-5 epochs. Reranker: Fine-tune bge-reranker-base with domain query-document relevance scores. Generator: SFT Llama-3.3-8B on domain QA pairs with retrieved context. This 3-stage approach maximizes end-to-end RAG quality.

Alignment & RLHF Models

Models & Tools for DPO, PPO, and Instruction Tuning

Base Models for RLHF / DPO

Model Params Pre-Aligned? DPO-Ready? Best For
meta-llama/Llama-3.3-8B-Instruct 8B Yes (SFT + RLHF) Yes Further DPO alignment for specific domains
Qwen/Qwen3-8B-Instruct 8B Yes (SFT + RLHF) Yes Multilingual alignment; strong code + reasoning
meta-llama/Llama-3.3-8B 8B No (base) After SFT Full RLHF pipeline: SFT → RM → PPO
HuggingFaceTB/SmolLM3-3B 3B Yes (instruct) Yes Lightweight DPO; resource-constrained alignment
google/gemma-2-9b-it 9B Yes (IT) Yes Safety-focused alignment fine-tuning

Reward Models

Model Base Use Case
OpenAssistant/reward-model-deberta-v3-large-v2 DeBERTa-v3-large General preference scoring; lightweight RM
Nexusflow/Starling-RM-34B Yi-34B High-quality reward model; strong correlation with human prefs
allenai/tulu-v2.5-13b-uf-rm Llama-2-13B UltraFeedback-trained; open RLHF pipeline

Preference Datasets

For DPO Training

  • HuggingFaceH4/ultrafeedback_binarized — 64K preferences; most popular DPO dataset
  • argilla/dpo-mix-7k — 7K high-quality curated preferences
  • Intel/orca_dpo_pairs — Orca-style DPO training data

For Instruction Tuning

  • tatsu-lab/alpaca — 52K GPT-generated instructions; classic dataset
  • HuggingFaceH4/no_robots — 10K human-written instructions
  • Open-Orca/OpenOrca — 4M+ multi-task instruction pairs
# DPO alignment with TRL from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-8B-Instruct") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-8B-Instruct") tokenizer.pad_token = tokenizer.eos_token # Load preference dataset dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train") # DPO training config dpo_config = DPOConfig( output_dir="./dpo-llama", num_train_epochs=1, per_device_train_batch_size=4, learning_rate=5e-7, beta=0.1, # DPO temperature warmup_ratio=0.1, logging_steps=10, gradient_accumulation_steps=4, ) # Train with DPO trainer = DPOTrainer( model=model, args=dpo_config, train_dataset=dataset, processing_class=tokenizer, ) trainer.train()
DPO vs RLHF Trade-off

DPO: 40-75% cheaper, more stable training, simpler implementation. RLHF (PPO): 8% unsafe outputs vs DPO's 10% in adversarial tests; better structured reasoning and OOD generalization. Start with DPO; switch to RLHF if safety requirements are strict.

Alignment Pipeline

Step 1: SFT on instruction dataset (Alpaca/OpenOrca). Step 2: DPO with UltraFeedback preferences. Step 3: Red-team evaluation. Step 4: Optional PPO with custom reward model for safety-critical domains.

Glossary of Fine-Tuning Terms

25 key technical terms used throughout this guide, organized alphabetically.

A

TermDefinition
AdamWThe standard optimizer for LLM training: Adam with decoupled weight decay. Typical learning rates for fine-tuning: 1e-5 to 5e-5 for large models, slightly higher for PEFT methods.
AdapterA small trainable module inserted into frozen Transformer layers. Consists of a down-projection, nonlinearity, and up-projection. Enables task-specific tuning with minimal parameter overhead.
AlpacaStanford's instruction-tuning dataset created by generating 52K instruction-response pairs from text-davinci-003. A landmark example of synthetic data for fine-tuning.

C

TermDefinition
Constitutional AIAnthropic's alignment technique where the model self-critiques outputs against a set of written principles (a "constitution"), reducing reliance on human annotators for safety training.
Curriculum LearningA training strategy that presents examples in increasing difficulty order (easy→hard), improving convergence and sample efficiency for fine-tuning.

D

TermDefinition
DPO (Direct Preference Optimization)An alignment method that directly optimizes from preference data using a binary classification loss. 40-75% cheaper than RLHF while achieving comparable alignment quality.

E

TermDefinition
EpochOne complete pass through the entire training dataset. Fine-tuning typically uses 2-4 epochs. More epochs risk overfitting, especially with small datasets.

F

TermDefinition
Federated LearningDistributed training where data stays on-device and only model updates are shared. Enables privacy-preserving fine-tuning across multiple data owners without centralizing sensitive data.
Full Fine-Tuning (SFT)Updating all model parameters on task-specific data. Achieves maximum quality but requires significant GPU memory and compute. 100% of parameters are trainable.

G

TermDefinition
Gradient AccumulationSimulating larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Enables training with large effective batches on limited GPU memory.
Gradient CheckpointingTrading compute for memory by recomputing intermediate activations during backpropagation instead of storing them. Reduces memory usage by ~60% at the cost of ~30% slower training.

H

TermDefinition
Human-in-the-Loop (HITL)An iterative workflow where model outputs are reviewed and corrected by humans, with corrections fed back as training data. Used in both SFT data creation and RLHF preference labeling.

I

TermDefinition
Instruction TuningFine-tuning an LLM on (instruction, response) pairs to improve instruction-following ability. The first alignment step after pre-training. Examples: Alpaca, FLAN, OpenOrca datasets.

L

TermDefinition
LoRA (Low-Rank Adaptation)Adds trainable low-rank matrices (rank r, typically 4-16) to frozen attention weights: W' = W + α(A×B). Trains 0.1-1% of parameters. Can be merged into base weights at inference for zero overhead.

M

TermDefinition
Mixed Precision (BF16/FP16)Training with half-precision floating point to reduce memory by 50% and speed up computation. BF16 is preferred for training stability. Requires loss scaling for FP16.

P

TermDefinition
PEFT (Parameter-Efficient Fine-Tuning)A family of methods (LoRA, adapters, prefix tuning, prompt tuning) that fine-tune only a small fraction (<1%) of model parameters while keeping the rest frozen.
PPO (Proximal Policy Optimization)The RL algorithm used in RLHF to update the LLM policy. Maximizes expected reward from a reward model while constraining updates with a KL penalty against the base model.
Prefix TuningA PEFT method that prepends trainable continuous vectors ("prefixes") to the keys and values in each attention layer. ~0.01% trainable parameters.
Prompt TuningThe simplest PEFT method: prepending learnable soft tokens to the input. <0.01% trainable parameters. No architectural changes needed. Works best with very large models.

Q

TermDefinition
QLoRAQuantized LoRA — loads the base model in 4-bit precision (NF4) and adds LoRA adapters on top. Enables fine-tuning a 70B model on a single 48GB GPU.

R

TermDefinition
Reward ModelA model trained to predict human preference scores for LLM outputs. Used in RLHF to provide reward signals for PPO training. Typically trained on pairwise preference data.
RLHFReinforcement Learning from Human Feedback — a 3-stage alignment pipeline: (1) SFT on demonstrations, (2) train reward model on preferences, (3) optimize LLM with PPO against the reward model.

S

TermDefinition
Synthetic DataTraining data generated by LLMs rather than human annotators. 10× cheaper than human data. Common approach: use a strong teacher model to generate instruction-response pairs for student fine-tuning.

T

TermDefinition
TRLTransformer Reinforcement Learning — HuggingFace's library for RLHF and DPO training. Provides SFTTrainer, DPOTrainer, PPOTrainer, and RewardTrainer.

W

TermDefinition
Weight DecayA regularization technique that adds a penalty proportional to parameter magnitude to the loss. Prevents overfitting during fine-tuning. Typical values: 0.01-0.1.
Full Reference: For a comprehensive glossary covering ALL LLM topics across all documents, see the unified LLM Glossary with 140+ terms.