LLM Fine-Tuning
Methods, Alignment & Production Deployment
Overview
Adapting large language models for specific domains and behaviors
What is Fine-Tuning?
Fine-tuning is the process of updating LLM parameters on task-specific or domain-specific data to adapt pre-trained models for specialized use cases. Unlike prompt engineering, fine-tuning modifies model weights to learn new patterns, vocabulary, and behaviors.
When to Fine-Tune
- Specialized domain (finance, legal, medical)
- Custom instruction format or style
- Preference alignment with human feedback
- Performance on benchmark tasks
- Cost reduction via smaller models
Key Use Cases
- Healthcare: clinical documentation
- Finance: risk analysis, compliance
- Legal: contract review, due diligence
- Customer support: brand voice
- Chatbots: user interaction patterns
Methods Comparison
Trade-offs between different fine-tuning approaches
| Method | Trainable Params | Compute Cost | Inference Impact | Use Case |
|---|---|---|---|---|
| Full SFT | 100% | Very High | None | Large resource budget |
| LoRA | 0.1–1% | Low | Minimal | Fast iteration, multiple tasks |
| Adapters | 0.1–1% per layer | Low | Minimal | Modular tuning |
| Prefix Tuning | 0.01% | Very Low | Slightly Slower | Rapid prototyping |
| Prompt Tuning | <0.01% | Very Low | None | Minimal intervention |
| RLHF/PPO | 100% (usually) | Very High | None | Alignment with preferences |
| DPO | 100% | Moderate | None | Preference learning (simplified) |
Full SFT
Supervised Fine-Tuning with comprehensive parameter updates
Supervised Fine-Tuning (SFT)
Full SFT updates all model parameters via cross-entropy loss on input-output pairs. It's the foundation for all fine-tuning and enables the largest capability improvements.
Advantages
- Maximum performance gains
- No inference overhead
- Full architectural flexibility
- Proven at scale (ChatGPT)
Challenges
- High GPU memory (80GB+)
- Long training time (days)
- Catastrophic forgetting risk
- Expensive at scale
Training Configuration
# SFT Configuration Example
model_name: "falcon-7b"
learning_rate: 2e-5
batch_size: 16
max_epochs: 3
warmup_steps: 500
max_seq_length: 2048
gradient_accumulation: 4
dtype: "bfloat16"SFT can degrade base model capabilities. Mitigate with: (1) Replay buffers mixing base data, (2) Lower learning rates, (3) Early stopping via validation set.
LoRA & PEFT
Parameter-Efficient Fine-Tuning with low-rank adapters
Low-Rank Adaptation (LoRA)
Instead of updating all W parameters, LoRA adds trainable low-rank decomposition matrices (A, B) to attention and feed-forward layers. Key insight: fine-tuning updates are inherently low-rank.
LoRA Hyperparameters
- rank (r): 4-16 typical, 8 recommended
- alpha (α): scaling factor, usually α = 2r
- dropout: 0.05-0.1 for regularization
- target_modules: q,v for efficiency
Benefits over Full SFT
- 5-10x faster training
- 10-20x smaller adapters
- No inference cost when merged
- Mix multiple task adapters
LoRA Configuration (HuggingFace PEFT)
from peft import get_peft_model, LoraConfig
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)RLHF & DPO
Aligning models with human preferences
Reinforcement Learning from Human Feedback (RLHF)
RLHF is a three-stage pipeline that aligns models with human preferences using reward signals and PPO optimization.
PPO Hyperparameters
- learning_rate: 1e-5 (smaller)
- batch_size: 4-8 (smaller)
- kl_penalty: 0.1-0.2
- clip_ratio: 0.2 (PPO clipping)
- epochs: 3-4 per batch
DPO: Direct Preference Optimization
Simpler alternative that skips reward model. Directly optimize on preference pairs (chosen vs. rejected) without RL. ~3x faster than RLHF, comparable results.
- Single-stage training
- No sampling complexity
- Empirically stable
Models can exploit reward function or diverge from base model. Address with: (1) Reference model KL penalty, (2) Validation on held-out prompts, (3) Careful reward function design.
Instruction Tuning
Aligning models to follow user instructions effectively
What is Instruction Tuning?
Instruction tuning is SFT on diverse task instruction-response pairs. Goal: teach model to interpret instructions and produce relevant outputs across many tasks, improving generalization to unseen instructions.
Key Characteristics
- Diverse task coverage (100-1000+ tasks)
- Explicit instruction format
- Quality output examples
- Enables zero-shot generalization
Popular Datasets
- Alpaca: 52K examples via text-davinci-003
- Flan: 1.3M examples, 146 tasks
- SuperNatural: 1000+ tasks
- Custom: domain-specific instructions
Example Instruction Data Format
# JSON format for instruction dataset
{
"instruction": "Translate the following text to French",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
# With system prompt (for newer models)
{
"messages": [
{"role": "system", "content": "You are a helpful translator."},
{"role": "user", "content": "Translate to French: Hello"},
{"role": "assistant", "content": "Bonjour"}
]
}Models learn to follow instruction patterns. High-quality, diverse instructions yield better generalization. Use 20-50 examples per task if custom-building datasets.
Data Strategies
Curation, augmentation, and quality control
Data Curation & Labeling
- Human experts: Domain specialists label high-quality examples
- Synthetic data: Use LLMs (e.g., GPT-4) to generate examples
- Cost: $5-50 per label depending on complexity
Data Augmentation
- Paraphrasing: Rephrase inputs/outputs
- Back-translation: Translate A→B→A
- Instruction variation: Multiple phrasings
- Multiplier: 2-5x dataset size
Data Quality Pipeline
# Data quality filtering with spaCy and regex
import spacy
from presidio_analyzer import AnalyzerEngine
nlp = spacy.load("en_core_web_sm")
analyzer = AnalyzerEngine()
def filter_pii(text):
results = analyzer.analyze(text=text, language="en")
return len(results) == 0
def filter_quality(text):
# Length checks
if len(text.split()) < 3:
return False
# PII check
if not filter_pii(text):
return False
return TrueHuman-in-the-Loop (HITL)
Iterative refinement:
- Initial model predicts
- Human reviews/corrects
- Use corrections to retrain
- Converge to domain expertise
Continual Data Updates
Keep models fresh:
- Monthly/quarterly retraining
- Replay buffer (old + new data)
- Federated learning for privacy
Loss Functions
Objectives for training, alignment, and evaluation
| Loss Type | Formula / Description | Use Case |
|---|---|---|
| Cross-Entropy (SFT) | CE = -Σ log P(y_t | x, y_{| Supervised fine-tuning |
|
| PPO Objective | L = E[min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t)] - λKL | Preference alignment (RLHF) |
| DPO Loss | L = -log σ(β log(π(y_w)/π_ref(y_w)) - β log(π(y_l)/π_ref(y_l))) | Direct preference learning |
| Reward Model (Ranking) | L = -log σ(r(y_w) - r(y_l)) | Training reward models for RLHF |
| Multi-Task Loss | L = Σ_i w_i * L_i | Training on multiple tasks |
| KL Penalty | L_kl = KL(π_new || π_ref) | Prevent divergence from base |
Cross-Entropy Loss
Standard for SFT. Penalizes incorrect token predictions. Works well for next-token prediction.
import torch
loss_fn = torch.nn.CrossEntropyLoss()
logits = model(input_ids).logits
loss = loss_fn(logits.view(-1, vocab_size),
labels.view(-1))DPO Loss Implementation
Directly optimize preference pairs without reward model.
def dpo_loss(logps_w, logps_l, beta=0.5):
diff = beta * (logps_w - logps_l)
return -torch.log(
torch.sigmoid(diff)
).mean()Wrong loss can lead to reward hacking (RLHF), poor generalization (multi-task), or divergence (no KL). Match loss to objective and validate on held-out test sets.
Evaluation
Metrics, benchmarks, and validation strategies
Language Metrics
- Perplexity: Lower is better
- BLEU: Sequence similarity
- ROUGE: Coverage overlap
- F1: Classification tasks
Safety & Alignment
- SafeBench: Adversarial safety
- TruthfulQA: Factuality
- Human eval: Likert scales
- Calibration (ECE): Confidence
Production Metrics
- Latency: p50/p95/p99
- Throughput: tokens/sec
- Cost: $ per request
- Drift: Performance over time
Evaluation Suite Example
from datasets import load_dataset
from evaluate import load
# Load benchmarks
mmlu = load_dataset("cais/mmlu", "all")
truthful_qa = load_dataset("truthful_qa")
# Metrics
rouge = load("rouge")
bleu = load("bleu")
accuracy = load("accuracy")
# Validation loop
def evaluate_model(model, dataset):
results = {}
for batch in dataset:
preds = model.generate(batch)
score = rouge.compute(predictions=preds)
return resultsUse hold-out test set (10-20% of data). Measure on task-specific metrics AND safety/alignment. Early stopping on validation loss or custom metric to prevent overfitting.
Deployment & MLOps
Production infrastructure and CI/CD pipelines
Serving Options
- REST API: Simple HTTP endpoints
- gRPC: Low-latency, streaming
- Embedded: On-device (mobile/edge)
- Batch: Offline processing
Inference Frameworks
- Triton Inference: Multi-backend GPU
- Ray Serve: Distributed scaling
- BentoML: Model packager
- KServe: K8s-native serving
MLOps Pipeline (Continuous Training)
# Example: GitHub Actions + MLflow + Ray
name: Fine-Tuning CI/CD
on:
schedule:
- cron: '0 0 * * 0' # Weekly retraining
jobs:
train:
steps:
- name: Fetch new data
run: python fetch_data.py
- name: Train with Ray Tune
run: ray train --config lora_config.yaml
- name: Evaluate
run: python evaluate.py
- name: Register to MLflow
run: mlflow models register
- name: A/B test canary
run: kubectl apply -f canary.yamlModel Registry
- MLflow: Version tracking
- HuggingFace Hub: Community sharing
- W&B: Experiment logging
- Metadata: metrics, params, tags
Monitoring & Alerts
- Latency p95 < 500ms
- Cost per request tracking
- Hallucination detection
- Reward signal drift
Cost Analysis & ROI
Complete training and inference cost breakdown with 2026 GPU pricing
Cloud GPU Pricing (March 2026)
| GPU | VRAM | On-Demand $/hr | Spot/Preemptible | Best For |
|---|---|---|---|---|
| NVIDIA A100 40GB | 40GB | $1.50-2.30 | $0.78-1.20 | LoRA fine-tuning 7-13B models |
| NVIDIA A100 80GB | 80GB | $2.00-3.00 | $1.00-1.80 | Full SFT 7B, QLoRA 70B |
| NVIDIA H100 SXM | 80GB | $2.40-4.00 | $1.50-2.50 | Fast training, RLHF, large-batch DPO |
| 4× A100 80GB | 320GB | $8.00-12.00 | $4.00-7.00 | Full SFT 70B, multi-GPU RLHF |
| 8× H100 SXM | 640GB | $20.00-32.00 | $12.00-20.00 | Full SFT 70B+, production RLHF pipelines |
Training Cost by Method & Model Size
| Method | Model Size | GPU Setup | GPU-Hours | Cloud Cost | Data Prep | Total |
|---|---|---|---|---|---|---|
| LoRA (r=8) | 7-8B | 1× A100 40GB | 4-8 hrs | $6-18 | $500-2K | $500-2K |
| QLoRA (4-bit) | 70B | 1× A100 80GB | 8-16 hrs | $16-48 | $1-3K | $1-3K |
| Full SFT | 7-8B | 2× A100 80GB | 20-50 hrs | $80-300 | $1-3K | $1-3.5K |
| Full SFT | 70B | 8× H100 | 50-150 hrs | $1-5K | $2-5K | $3-10K |
| DPO | 7-8B | 2× A100 80GB | 20-50 hrs | $80-300 | $2-5K | $2-5.5K |
| RLHF (PPO) | 7-8B | 4× A100 80GB | 200-500 hrs | $1.6-5K | $5-15K | $7-20K |
Fine-Tuning vs API: Break-Even Analysis
Scenario: 100K queries/day
API route (GPT-4o-mini): ~500 tokens/query × 100K = 50M tokens/day
Cost: $0.15/1M input + $0.60/1M output ≈ $37.50/day = $1,125/month
Fine-tuned Llama-3.3-8B (self-hosted):
1× A100 80GB at $2/hr = $1,440/month + one-time fine-tune $2K
Fine-tuned (quantized INT4):
1× A40 48GB at $0.80/hr = $576/month + one-time fine-tune $2K
Break-even: Quantized fine-tuned model pays back in ~4 months, then saves $549/month (49%)
Scenario: 1M queries/day
API route (GPT-4o-mini):
500M tokens/day ≈ $375/day = $11,250/month
Fine-tuned Llama-3.3-8B (vLLM, 4× A100):
$8/hr × 24 × 30 = $5,760/month + one-time $3K
Break-even: 18 days. Annual savings: $62,880 (46%). Plus: lower latency, data privacy, no vendor lock-in.
Cost Optimization Strategies
Training Savings
- • LoRA/QLoRA: 5-50x cheaper
- • Spot instances: 40-70% off
- • BF16/FP16: 2x memory efficiency
- • Gradient accum: Fewer GPUs
- • Flash Attention 2: 2-3x speedup
Inference Savings
- • INT4 quantization: 4x less VRAM
- • vLLM batching: 10-50x throughput
- • KV-cache: Reduce recomputation
- • Speculative decoding: 2-3x faster
- • Response caching: Skip repeated
Data Savings
- • Synthetic data: 10x cheaper than human
- • Active learning: Label only hard cases
- • Curriculum: Fewer epochs needed
- • DPO vs RLHF: 40-75% cheaper
- • Distill then tune: Smaller base
<10K queries/day: Use API (GPT-4o-mini or Claude Haiku at $0.25/1M). 10K-100K/day: Fine-tune with LoRA + quantize, self-host on 1 GPU. 100K-1M/day: Full SFT or DPO + vLLM on 2-4 GPUs — saves 40-60% vs API. >1M/day: Fine-tune + quantize + dedicated cluster — saves 60-80% vs API, payback in 2-4 weeks.
Safety & Alignment
Ensuring models behave safely and responsibly
Safety Risks
- Overconfidence after RLHF
- Reward hacking (gaming objective)
- Capability regression
- Adversarial jailbreaks
- Bias amplification
Mitigation Strategies
- Constitutional AI: Encode safety rules
- Red-teaming: Adversarial testing
- Calibration: Uncertainty estimates
- KL penalties: Limit divergence
Constitutional AI: Principle-Based Alignment
# Constitutional AI: Define safety principles
principles = [
"Be helpful and harmless",
"Don't provide medical advice",
"Refuse illegal requests",
"Acknowledge uncertainty"
]
# Red team: Critique unsafe responses
critique_prompt = """
Identify any harmful or unsafe content in the response.
Rate on scale 1-5 where 5 is most harmful.
"""
# Revise: Model improves response based on critique
revision_prompt = """
Provide an improved response that adheres to our principles.
"""Use benchmarks like AdvBench (100 jailbreak attempts) and OWASP Top 10 for GenAI. Test for: prompt injection, model extraction, bias amplification, hallucinations.
Privacy & Compliance
Data governance, regulation, and secure training
PII Handling
- Detection: spaCy NER, Presidio
- Redaction: Mask PII tokens
- Audit trails: Log data access
- Retention: Auto-delete policies
Privacy Techniques
- Differential Privacy: Opacus, TensorFlow Privacy
- Federated Learning: Train on-device
- Secure Enclaves: Hardware isolation
- Data minimization: Collect only needed
Compliance Requirements
GDPR
Right to erasure, consent, data portability
HIPAA
Healthcare: encryption, audit logs, BAAs
Model Licensing
Commercial use, CC, proprietary restrictions
Plan privacy early: data minimization, consent management, retention policies. Use privacy-preserving techniques (DP, federated) for sensitive domains.
Failure Modes
Common pitfalls and mitigation strategies
Catastrophic Forgetting
Base model capabilities degrade during fine-tuning on specific data.
Fix: Replay buffer mixing base data (10-20%), lower LR, early stopping
Overfitting to Feedback
Model exploits reward function instead of learning genuine behavior.
Fix: KL penalty, diverse reward signals, hold-out test set
Bias & Toxicity Regression
Fine-tuning amplifies harmful biases or toxicity in specific domains.
Fix: Balanced datasets, debiasing, red-teaming, fairness metrics
PPO Instability
PPO training diverges or oscillates in loss/reward.
Fix: Smaller LR (1e-5), fewer PPO epochs, monitor KL divergence
Safety Regressions
Model becomes less safe or generates harmful content after tuning.
Fix: Safety-balanced data, adversarial eval, constitutional AI
Pipeline Failures
Data preprocessing, training, or serving breaks silently.
Fix: Unit tests, integration tests, monitoring, alerting
Implementation Roadmap
5-phase plan from planning to production
Phase 1: Planning & Data Prep
Define objectives, audit/cleanse data, set up MLOps.
- Document fine-tuning goals and success metrics
- Audit dataset for PII, quality, diversity
- Set up MLflow, W&B, or similar tracking
- Plan compute budget and timeline
Phase 2: Prototype & Baseline
SFT baseline, evaluate, trial LoRA, measure trade-offs.
- Train small SFT baseline (7B-13B model)
- Evaluate on task-specific metrics + safety
- Trial LoRA with same data, compare speed/quality
- Document results and decide method
Phase 3: Alignment & Safety
Human feedback, reward model, PPO/DPO, red-teaming.
- Collect human preference data (5K-20K pairs)
- Train reward model if using RLHF
- Run PPO/DPO training with KL penalties
- Red-team for jailbreaks, adversarial examples
Phase 4: Optimization & Testing
Quantization, end-to-end evaluation, A/B testing.
- Apply quantization (INT8, GPTQ) for faster inference
- End-to-end latency and throughput testing
- Canary deployment (5-10% traffic)
- A/B test vs. baseline model, measure win rate
Phase 5: Production Deployment
Full production, monitoring, continuous retraining.
- Deploy to production with versioning
- Monitor latency, cost, accuracy, drift
- Set up continuous retraining on new data
- Plan quarterly reviews and updates
Tools & References
Essential libraries, frameworks, and resources
Training Libraries
- HuggingFace Transformers: Base models, Trainer API
- PEFT: LoRA, Adapters, Prefix tuning
- Accelerate: Distributed training, quantization
- DeepSpeed: ZeRO, optimization
- Ray Tune: Hyperparameter tuning
Serving & Deployment
- BentoML: Model packaging, REST API
- MLflow: Model registry, serving
- Ray Serve: Distributed inference
- KServe: Kubernetes-native
- TensorRT-LLM: NVIDIA optimization
Monitoring & Evaluation
- Weights & Biases: Experiment logging
- TensorBoard: Metrics visualization
- Hugging Face Evaluate: Benchmarks
- OpenCompass: LLM leaderboards
- HELM: Safety evaluation
Data & Infrastructure
- HuggingFace Datasets: Data loading
- Presidio: PII detection
- spaCy: NLP utilities
- Kubernetes: Container orchestration
- Airflow: Workflow management
Key Papers & Resources
Foundational
- LoRA: Low-Rank Adaptation (Hu et al.)
- Instruction Tuning with FLAN (Wei et al.)
- Training language models to follow instructions (InstructGPT)
Advanced
- Direct Preference Optimization (Rafailov et al.)
- Constitutional AI (Bai et al.)
- Scaling Laws & Chinchilla (DeepMind)
SFT: Transformers Trainer + PEFT (LoRA) + W&B. Serving: BentoML or Ray Serve. Monitoring: W&B + custom dashboards. This covers 80% of use cases with minimal overhead.
HuggingFace Base & SFT Models
Foundation Models for Fine-Tuning Across Use Cases
Large Models (13B+) — Maximum Quality
| Model | Params | Context | MMLU | License | Best For |
|---|---|---|---|---|---|
| meta-llama/Llama-3.1-70B | 70B | 128K | ~86 | Llama 3.1 | Best open teacher; general SFT for all domains |
| Qwen/Qwen3-32B | 32B | 128K | ~83 | Apache 2.0 | Multilingual SFT; strong reasoning; permissive license |
| Qwen/Qwen3-30B-A3B | 30B (3B active) | 262K | ~82 | Apache 2.0 | MoE: 30B quality at 3B inference cost; RAG generation |
| mistralai/Mistral-Large-2 | 123B | 128K | ~84 | Mistral | Code + reasoning; enterprise applications |
| microsoft/Phi-4 | 14B | 16K | ~78 | MIT | High quality per parameter; reasoning-focused |
Small Models (1B-8B) — Cost-Efficient Fine-Tuning
| Model | Params | MMLU | HumanEval | LoRA VRAM | Best For |
|---|---|---|---|---|---|
| meta-llama/Llama-3.3-8B | 8B | 73.0 | 72.6 | ~16GB | Best all-around 8B; recommended starting point |
| Qwen/Qwen3-8B | 8B | ~72 | ~75 | ~16GB | Best code generation at 8B; strong multilingual |
| HuggingFaceTB/SmolLM3-3B | 3B | ~67 | ~58 | ~8GB | Best 3B model; full training blueprint published |
| microsoft/Phi-4-mini-instruct | 3.8B | ~70 | ~66 | ~10GB | Edge deployment; reasoning-heavy tasks |
| google/gemma-2-9b | 9B | ~71 | ~64 | ~18GB | Google ecosystem; good instruction following |
| mistralai/Mistral-7B-v0.3 | 7B | ~63 | ~40 | ~14GB | Sliding window attention; fast inference |
| Qwen/Qwen3-1.7B | 1.7B | ~55 | ~35 | ~4GB | Ultra-lightweight; mobile/IoT fine-tuning |
# Quick-start: LoRA fine-tuning with any base model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig
# Choose your base model
model_name = "meta-llama/Llama-3.3-8B" # or Qwen3-8B, SmolLM3-3B, etc.
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Apply LoRA
lora_config = LoraConfig(
r=8, lora_alpha=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05, task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
print(model.print_trainable_parameters()) # ~0.2% of total params
Budget GPU (16GB): Llama-3.3-8B or Qwen3-8B with QLoRA. Minimal GPU (8GB): SmolLM3-3B or Phi-4-mini with QLoRA 4-bit. Maximum quality: Llama-3.1-70B with QLoRA on 2×A100. Multilingual: Qwen3 family (all sizes). Edge/mobile: Qwen3-1.7B or Qwen3-0.6B.
HuggingFace RAG & Embedding Models
Models for Retrieval-Augmented Fine-Tuning
Embedding Models (Fine-Tunable)
| Model | Params | Dims | MTEB | Fine-Tune Use Case |
|---|---|---|---|---|
| Alibaba/Qwen3-Embedding-8B | 8B | 32-4096 | 70.58 (#1) | Domain-specific retrieval with custom dims |
| BAAI/bge-m3 | 568M | 1024 | ~66 | Multilingual RAG; dense + sparse + multi-vector |
| jinaai/jina-embeddings-v3 | 570M | 1024 | ~65 | Multi-task fine-tuning (retrieval + classification) |
| BAAI/bge-base-en-v1.5 | 109M | 768 | ~63 | English-only domain adaptation; fast fine-tuning |
| sentence-transformers/all-MiniLM-L6-v2 | 22M | 384 | ~56 | Ultra-fast; fine-tune for domain similarity tasks |
| sentence-transformers/all-mpnet-base-v2 | 109M | 768 | ~60 | Best sentence-transformer; STS fine-tuning |
Reranker Models (Fine-Tunable)
| Model | Params | Context | Fine-Tune Use Case |
|---|---|---|---|
| mixedbread-ai/mxbai-rerank-large-v2 | 1.5B | 8K | Domain reranking; 100+ languages; RL-trained baseline |
| BAAI/bge-reranker-v2-m3 | 568M | 8K | Multilingual reranking fine-tuning |
| BAAI/bge-reranker-base | 278M | 512 | Lightweight domain reranker; fast fine-tuning |
| cross-encoder/ms-marco-MiniLM-L-6-v2 | 22M | 512 | Ultra-fast reranker; MS-MARCO pre-trained |
| colbert-ir/colbertv2.0 | 110M | 512 | Late-interaction retrieval; use with RAGatouille |
# Fine-tune embedding model for domain-specific retrieval
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load pre-trained embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Domain-specific training pairs
train_examples = [
InputExample(texts=["patient symptoms", "clinical presentation"], label=0.9),
InputExample(texts=["patient symptoms", "stock market"], label=0.1),
]
# Contrastive loss for retrieval
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./domain-embeddings"
)
Embedding: Fine-tune bge-base-en-v1.5 on domain pairs with contrastive loss for 3-5 epochs. Reranker: Fine-tune bge-reranker-base with domain query-document relevance scores. Generator: SFT Llama-3.3-8B on domain QA pairs with retrieved context. This 3-stage approach maximizes end-to-end RAG quality.
Alignment & RLHF Models
Models & Tools for DPO, PPO, and Instruction Tuning
Base Models for RLHF / DPO
| Model | Params | Pre-Aligned? | DPO-Ready? | Best For |
|---|---|---|---|---|
| meta-llama/Llama-3.3-8B-Instruct | 8B | Yes (SFT + RLHF) | Yes | Further DPO alignment for specific domains |
| Qwen/Qwen3-8B-Instruct | 8B | Yes (SFT + RLHF) | Yes | Multilingual alignment; strong code + reasoning |
| meta-llama/Llama-3.3-8B | 8B | No (base) | After SFT | Full RLHF pipeline: SFT → RM → PPO |
| HuggingFaceTB/SmolLM3-3B | 3B | Yes (instruct) | Yes | Lightweight DPO; resource-constrained alignment |
| google/gemma-2-9b-it | 9B | Yes (IT) | Yes | Safety-focused alignment fine-tuning |
Reward Models
| Model | Base | Use Case |
|---|---|---|
| OpenAssistant/reward-model-deberta-v3-large-v2 | DeBERTa-v3-large | General preference scoring; lightweight RM |
| Nexusflow/Starling-RM-34B | Yi-34B | High-quality reward model; strong correlation with human prefs |
| allenai/tulu-v2.5-13b-uf-rm | Llama-2-13B | UltraFeedback-trained; open RLHF pipeline |
Preference Datasets
For DPO Training
- • HuggingFaceH4/ultrafeedback_binarized — 64K preferences; most popular DPO dataset
- • argilla/dpo-mix-7k — 7K high-quality curated preferences
- • Intel/orca_dpo_pairs — Orca-style DPO training data
For Instruction Tuning
- • tatsu-lab/alpaca — 52K GPT-generated instructions; classic dataset
- • HuggingFaceH4/no_robots — 10K human-written instructions
- • Open-Orca/OpenOrca — 4M+ multi-task instruction pairs
# DPO alignment with TRL
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
# Load preference dataset
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train")
# DPO training config
dpo_config = DPOConfig(
output_dir="./dpo-llama",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7,
beta=0.1, # DPO temperature
warmup_ratio=0.1,
logging_steps=10,
gradient_accumulation_steps=4,
)
# Train with DPO
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
DPO: 40-75% cheaper, more stable training, simpler implementation. RLHF (PPO): 8% unsafe outputs vs DPO's 10% in adversarial tests; better structured reasoning and OOD generalization. Start with DPO; switch to RLHF if safety requirements are strict.
Step 1: SFT on instruction dataset (Alpaca/OpenOrca). Step 2: DPO with UltraFeedback preferences. Step 3: Red-team evaluation. Step 4: Optional PPO with custom reward model for safety-critical domains.
Glossary of Fine-Tuning Terms
25 key technical terms used throughout this guide, organized alphabetically.
A
| Term | Definition |
|---|---|
| AdamW | The standard optimizer for LLM training: Adam with decoupled weight decay. Typical learning rates for fine-tuning: 1e-5 to 5e-5 for large models, slightly higher for PEFT methods. |
| Adapter | A small trainable module inserted into frozen Transformer layers. Consists of a down-projection, nonlinearity, and up-projection. Enables task-specific tuning with minimal parameter overhead. |
| Alpaca | Stanford's instruction-tuning dataset created by generating 52K instruction-response pairs from text-davinci-003. A landmark example of synthetic data for fine-tuning. |
C
| Term | Definition |
|---|---|
| Constitutional AI | Anthropic's alignment technique where the model self-critiques outputs against a set of written principles (a "constitution"), reducing reliance on human annotators for safety training. |
| Curriculum Learning | A training strategy that presents examples in increasing difficulty order (easy→hard), improving convergence and sample efficiency for fine-tuning. |
D
| Term | Definition |
|---|---|
| DPO (Direct Preference Optimization) | An alignment method that directly optimizes from preference data using a binary classification loss. 40-75% cheaper than RLHF while achieving comparable alignment quality. |
E
| Term | Definition |
|---|---|
| Epoch | One complete pass through the entire training dataset. Fine-tuning typically uses 2-4 epochs. More epochs risk overfitting, especially with small datasets. |
F
| Term | Definition |
|---|---|
| Federated Learning | Distributed training where data stays on-device and only model updates are shared. Enables privacy-preserving fine-tuning across multiple data owners without centralizing sensitive data. |
| Full Fine-Tuning (SFT) | Updating all model parameters on task-specific data. Achieves maximum quality but requires significant GPU memory and compute. 100% of parameters are trainable. |
G
| Term | Definition |
|---|---|
| Gradient Accumulation | Simulating larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Enables training with large effective batches on limited GPU memory. |
| Gradient Checkpointing | Trading compute for memory by recomputing intermediate activations during backpropagation instead of storing them. Reduces memory usage by ~60% at the cost of ~30% slower training. |
H
| Term | Definition |
|---|---|
| Human-in-the-Loop (HITL) | An iterative workflow where model outputs are reviewed and corrected by humans, with corrections fed back as training data. Used in both SFT data creation and RLHF preference labeling. |
I
| Term | Definition |
|---|---|
| Instruction Tuning | Fine-tuning an LLM on (instruction, response) pairs to improve instruction-following ability. The first alignment step after pre-training. Examples: Alpaca, FLAN, OpenOrca datasets. |
L
| Term | Definition |
|---|---|
| LoRA (Low-Rank Adaptation) | Adds trainable low-rank matrices (rank r, typically 4-16) to frozen attention weights: W' = W + α(A×B). Trains 0.1-1% of parameters. Can be merged into base weights at inference for zero overhead. |
M
| Term | Definition |
|---|---|
| Mixed Precision (BF16/FP16) | Training with half-precision floating point to reduce memory by 50% and speed up computation. BF16 is preferred for training stability. Requires loss scaling for FP16. |
P
| Term | Definition |
|---|---|
| PEFT (Parameter-Efficient Fine-Tuning) | A family of methods (LoRA, adapters, prefix tuning, prompt tuning) that fine-tune only a small fraction (<1%) of model parameters while keeping the rest frozen. |
| PPO (Proximal Policy Optimization) | The RL algorithm used in RLHF to update the LLM policy. Maximizes expected reward from a reward model while constraining updates with a KL penalty against the base model. |
| Prefix Tuning | A PEFT method that prepends trainable continuous vectors ("prefixes") to the keys and values in each attention layer. ~0.01% trainable parameters. |
| Prompt Tuning | The simplest PEFT method: prepending learnable soft tokens to the input. <0.01% trainable parameters. No architectural changes needed. Works best with very large models. |
Q
| Term | Definition |
|---|---|
| QLoRA | Quantized LoRA — loads the base model in 4-bit precision (NF4) and adds LoRA adapters on top. Enables fine-tuning a 70B model on a single 48GB GPU. |
R
| Term | Definition |
|---|---|
| Reward Model | A model trained to predict human preference scores for LLM outputs. Used in RLHF to provide reward signals for PPO training. Typically trained on pairwise preference data. |
| RLHF | Reinforcement Learning from Human Feedback — a 3-stage alignment pipeline: (1) SFT on demonstrations, (2) train reward model on preferences, (3) optimize LLM with PPO against the reward model. |
S
| Term | Definition |
|---|---|
| Synthetic Data | Training data generated by LLMs rather than human annotators. 10× cheaper than human data. Common approach: use a strong teacher model to generate instruction-response pairs for student fine-tuning. |
T
| Term | Definition |
|---|---|
| TRL | Transformer Reinforcement Learning — HuggingFace's library for RLHF and DPO training. Provides SFTTrainer, DPOTrainer, PPOTrainer, and RewardTrainer. |
W
| Term | Definition |
|---|---|
| Weight Decay | A regularization technique that adds a penalty proportional to parameter magnitude to the loss. Prevents overfitting during fine-tuning. Typical values: 0.01-0.1. |