LLM Fine-Tuning

Methods, Alignment & Production Deployment

5x

Speedup with LoRA

100%

Parametric Control

RLHF

State-of-the-art

Overview

Adapting large language models for specific domains and behaviors

What is Fine-Tuning?

Fine-tuning is the process of updating LLM parameters on task-specific or domain-specific data to adapt pre-trained models for specialized use cases. Unlike prompt engineering, fine-tuning modifies model weights to learn new patterns, vocabulary, and behaviors.

Domain Adaptation Instruction Following Preference Learning Personalization

When to Fine-Tune

Specialized domain (finance, legal, medical)
Custom instruction format or style
Preference alignment with human feedback
Performance on benchmark tasks
Cost reduction via smaller models

Key Use Cases

Healthcare: clinical documentation
Finance: risk analysis, compliance
Legal: contract review, due diligence
Customer support: brand voice
Chatbots: user interaction patterns

Methods Comparison

Trade-offs between different fine-tuning approaches

Method	Trainable Params	Compute Cost	Inference Impact	Use Case
Full SFT	100%	Very High	None	Large resource budget
LoRA	0.1–1%	Low	Minimal	Fast iteration, multiple tasks
Adapters	0.1–1% per layer	Low	Minimal	Modular tuning
Prefix Tuning	0.01%	Very Low	Slightly Slower	Rapid prototyping
Prompt Tuning	<0.01%	Very Low	None	Minimal intervention
RLHF/PPO	100% (usually)	Very High	None	Alignment with preferences
DPO	100%	Moderate	None	Preference learning (simplified)

Full SFT

Supervised Fine-Tuning with comprehensive parameter updates

Supervised Fine-Tuning (SFT)

Full SFT updates all model parameters via cross-entropy loss on input-output pairs. It's the foundation for all fine-tuning and enables the largest capability improvements.

Advantages

Maximum performance gains
No inference overhead
Full architectural flexibility
Proven at scale (ChatGPT)

Challenges

High GPU memory (80GB+)
Long training time (days)
Catastrophic forgetting risk
Expensive at scale

Training Configuration

# SFT Configuration Example
model_name: "falcon-7b"
learning_rate: 2e-5
batch_size: 16
max_epochs: 3
warmup_steps: 500
max_seq_length: 2048
gradient_accumulation: 4
dtype: "bfloat16"

Catastrophic Forgetting

SFT can degrade base model capabilities. Mitigate with: (1) Replay buffers mixing base data, (2) Lower learning rates, (3) Early stopping via validation set.

LoRA & PEFT

Parameter-Efficient Fine-Tuning with low-rank adapters

Low-Rank Adaptation (LoRA)

Instead of updating all W parameters, LoRA adds trainable low-rank decomposition matrices (A, B) to attention and feed-forward layers. Key insight: fine-tuning updates are inherently low-rank.

Parameter-Efficient Fast Iteration Multi-Task

LoRA Hyperparameters

rank (r): 4-16 typical, 8 recommended
alpha (α): scaling factor, usually α = 2r
dropout: 0.05-0.1 for regularization
target_modules: q,v for efficiency

Benefits over Full SFT

5-10x faster training
10-20x smaller adapters
No inference cost when merged
Mix multiple task adapters

LoRA Configuration (HuggingFace PEFT)

from peft import get_peft_model, LoraConfig

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)

RLHF & DPO

Aligning models with human preferences

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a three-stage pipeline that aligns models with human preferences using reward signals and PPO optimization.

PPO Hyperparameters

learning_rate: 1e-5 (smaller)
batch_size: 4-8 (smaller)
kl_penalty: 0.1-0.2
clip_ratio: 0.2 (PPO clipping)
epochs: 3-4 per batch

DPO: Direct Preference Optimization

Simpler alternative that skips reward model. Directly optimize on preference pairs (chosen vs. rejected) without RL. ~3x faster than RLHF, comparable results.

Single-stage training
No sampling complexity
Empirically stable

Reward Hacking & KL Divergence

Models can exploit reward function or diverge from base model. Address with: (1) Reference model KL penalty, (2) Validation on held-out prompts, (3) Careful reward function design.

Instruction Tuning

Aligning models to follow user instructions effectively

What is Instruction Tuning?

Instruction tuning is SFT on diverse task instruction-response pairs. Goal: teach model to interpret instructions and produce relevant outputs across many tasks, improving generalization to unseen instructions.

Key Characteristics

Diverse task coverage (100-1000+ tasks)
Explicit instruction format
Quality output examples
Enables zero-shot generalization

Popular Datasets

Alpaca: 52K examples via text-davinci-003
Flan: 1.3M examples, 146 tasks
SuperNatural: 1000+ tasks
Custom: domain-specific instructions

Example Instruction Data Format

# JSON format for instruction dataset
{
  "instruction": "Translate the following text to French",
  "input": "Hello, how are you?",
  "output": "Bonjour, comment allez-vous?"
}

# With system prompt (for newer models)
{
  "messages": [
    {"role": "system", "content": "You are a helpful translator."},
    {"role": "user", "content": "Translate to French: Hello"},
    {"role": "assistant", "content": "Bonjour"}
  ]
}

Instruction Quality Matters

Models learn to follow instruction patterns. High-quality, diverse instructions yield better generalization. Use 20-50 examples per task if custom-building datasets.

Data Strategies

Curation, augmentation, and quality control

Data Curation & Labeling

Human experts: Domain specialists label high-quality examples
Synthetic data: Use LLMs (e.g., GPT-4) to generate examples
Cost: $5-50 per label depending on complexity

Data Augmentation

Paraphrasing: Rephrase inputs/outputs
Back-translation: Translate A→B→A
Instruction variation: Multiple phrasings
Multiplier: 2-5x dataset size

Data Quality Pipeline

# Data quality filtering with spaCy and regex
import spacy
from presidio_analyzer import AnalyzerEngine

nlp = spacy.load("en_core_web_sm")
analyzer = AnalyzerEngine()

def filter_pii(text):
    results = analyzer.analyze(text=text, language="en")
    return len(results) == 0

def filter_quality(text):
    # Length checks
    if len(text.split()) < 3:
        return False
    # PII check
    if not filter_pii(text):
        return False
    return True

Human-in-the-Loop (HITL)

Iterative refinement:

Initial model predicts
Human reviews/corrects
Use corrections to retrain
Converge to domain expertise

Continual Data Updates

Keep models fresh:

Monthly/quarterly retraining
Replay buffer (old + new data)
Federated learning for privacy

Loss Functions

Objectives for training, alignment, and evaluation

Loss Type	Formula / Description	Use Case
Cross-Entropy (SFT)	CE = -Σ log P(y_t \| x, y_{	Supervised fine-tuning
PPO Objective	L = E[min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t)] - λKL	Preference alignment (RLHF)
DPO Loss	L = -log σ(β log(π(y_w)/π_ref(y_w)) - β log(π(y_l)/π_ref(y_l)))	Direct preference learning
Reward Model (Ranking)	L = -log σ(r(y_w) - r(y_l))	Training reward models for RLHF
Multi-Task Loss	L = Σ_i w_i * L_i	Training on multiple tasks
KL Penalty	L_kl = KL(π_new \|\| π_ref)	Prevent divergence from base

Cross-Entropy Loss

Standard for SFT. Penalizes incorrect token predictions. Works well for next-token prediction.

import torch

loss_fn = torch.nn.CrossEntropyLoss()
logits = model(input_ids).logits
loss = loss_fn(logits.view(-1, vocab_size),
                      labels.view(-1))

DPO Loss Implementation

Directly optimize preference pairs without reward model.

def dpo_loss(logps_w, logps_l, beta=0.5):
    diff = beta * (logps_w - logps_l)
    return -torch.log(
        torch.sigmoid(diff)
    ).mean()

Loss Function Selection Matters

Wrong loss can lead to reward hacking (RLHF), poor generalization (multi-task), or divergence (no KL). Match loss to objective and validate on held-out test sets.

Evaluation

Metrics, benchmarks, and validation strategies

Language Metrics

Perplexity: Lower is better
BLEU: Sequence similarity
ROUGE: Coverage overlap
F1: Classification tasks

Safety & Alignment

SafeBench: Adversarial safety
TruthfulQA: Factuality
Human eval: Likert scales
Calibration (ECE): Confidence

Production Metrics

Latency: p50/p95/p99
Throughput: tokens/sec
Cost: $ per request
Drift: Performance over time

Evaluation Suite Example

from datasets import load_dataset
from evaluate import load

# Load benchmarks
mmlu = load_dataset("cais/mmlu", "all")
truthful_qa = load_dataset("truthful_qa")

# Metrics
rouge = load("rouge")
bleu = load("bleu")
accuracy = load("accuracy")

# Validation loop
def evaluate_model(model, dataset):
    results = {}
    for batch in dataset:
        preds = model.generate(batch)
        score = rouge.compute(predictions=preds)
    return results

Validation Strategy

Use hold-out test set (10-20% of data). Measure on task-specific metrics AND safety/alignment. Early stopping on validation loss or custom metric to prevent overfitting.

Deployment & MLOps

Production infrastructure and CI/CD pipelines

Serving Options

REST API: Simple HTTP endpoints
gRPC: Low-latency, streaming
Embedded: On-device (mobile/edge)
Batch: Offline processing

Inference Frameworks

Triton Inference: Multi-backend GPU
Ray Serve: Distributed scaling
BentoML: Model packager
KServe: K8s-native serving

MLOps Pipeline (Continuous Training)

# Example: GitHub Actions + MLflow + Ray
name: Fine-Tuning CI/CD

on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly retraining

jobs:
  train:
    steps:
      - name: Fetch new data
        run: python fetch_data.py
      - name: Train with Ray Tune
        run: ray train --config lora_config.yaml
      - name: Evaluate
        run: python evaluate.py
      - name: Register to MLflow
        run: mlflow models register
      - name: A/B test canary
        run: kubectl apply -f canary.yaml

Model Registry

MLflow: Version tracking
HuggingFace Hub: Community sharing
W&B: Experiment logging
Metadata: metrics, params, tags

Monitoring & Alerts

Latency p95 < 500ms
Cost per request tracking
Hallucination detection
Reward signal drift

Cost Analysis & ROI

Complete training and inference cost breakdown with 2026 GPU pricing

$1.50-3.00

A100/H100 per hour

5-50x

LoRA Cost Reduction

$0.05-5.00

API cost per 1M tokens

1-4 weeks

Typical Break-Even

Cloud GPU Pricing (March 2026)

GPU	VRAM	On-Demand $/hr	Spot/Preemptible	Best For
NVIDIA A100 40GB	40GB	$1.50-2.30	$0.78-1.20	LoRA fine-tuning 7-13B models
NVIDIA A100 80GB	80GB	$2.00-3.00	$1.00-1.80	Full SFT 7B, QLoRA 70B
NVIDIA H100 SXM	80GB	$2.40-4.00	$1.50-2.50	Fast training, RLHF, large-batch DPO
4× A100 80GB	320GB	$8.00-12.00	$4.00-7.00	Full SFT 70B, multi-GPU RLHF
8× H100 SXM	640GB	$20.00-32.00	$12.00-20.00	Full SFT 70B+, production RLHF pipelines

Training Cost by Method & Model Size

Method	Model Size	GPU Setup	GPU-Hours	Cloud Cost	Data Prep	Total
LoRA (r=8)	7-8B	1× A100 40GB	4-8 hrs	$6-18	$500-2K	$500-2K
QLoRA (4-bit)	70B	1× A100 80GB	8-16 hrs	$16-48	$1-3K	$1-3K
Full SFT	7-8B	2× A100 80GB	20-50 hrs	$80-300	$1-3K	$1-3.5K
Full SFT	70B	8× H100	50-150 hrs	$1-5K	$2-5K	$3-10K
DPO	7-8B	2× A100 80GB	20-50 hrs	$80-300	$2-5K	$2-5.5K
RLHF (PPO)	7-8B	4× A100 80GB	200-500 hrs	$1.6-5K	$5-15K	$7-20K

Fine-Tuning vs API: Break-Even Analysis

Scenario: 100K queries/day

API route (GPT-4o-mini): ~500 tokens/query × 100K = 50M tokens/day
Cost: $0.15/1M input + $0.60/1M output ≈ $37.50/day = $1,125/month

Fine-tuned Llama-3.3-8B (self-hosted):
1× A100 80GB at $2/hr = $1,440/month + one-time fine-tune $2K

Fine-tuned (quantized INT4):
1× A40 48GB at $0.80/hr = $576/month + one-time fine-tune $2K

Break-even: Quantized fine-tuned model pays back in ~4 months, then saves $549/month (49%)

Scenario: 1M queries/day

API route (GPT-4o-mini):
500M tokens/day ≈ $375/day = $11,250/month

Fine-tuned Llama-3.3-8B (vLLM, 4× A100):
$8/hr × 24 × 30 = $5,760/month + one-time $3K

Break-even: 18 days. Annual savings: $62,880 (46%). Plus: lower latency, data privacy, no vendor lock-in.

Cost Optimization Strategies

Training Savings

• LoRA/QLoRA: 5-50x cheaper
• Spot instances: 40-70% off
• BF16/FP16: 2x memory efficiency
• Gradient accum: Fewer GPUs
• Flash Attention 2: 2-3x speedup

Inference Savings

• INT4 quantization: 4x less VRAM
• vLLM batching: 10-50x throughput
• KV-cache: Reduce recomputation
• Speculative decoding: 2-3x faster
• Response caching: Skip repeated

Data Savings

• Synthetic data: 10x cheaper than human
• Active learning: Label only hard cases
• Curriculum: Fewer epochs needed
• DPO vs RLHF: 40-75% cheaper
• Distill then tune: Smaller base

Decision Framework

<10K queries/day: Use API (GPT-4o-mini or Claude Haiku at $0.25/1M). 10K-100K/day: Fine-tune with LoRA + quantize, self-host on 1 GPU. 100K-1M/day: Full SFT or DPO + vLLM on 2-4 GPUs — saves 40-60% vs API. >1M/day: Fine-tune + quantize + dedicated cluster — saves 60-80% vs API, payback in 2-4 weeks.

Safety & Alignment

Ensuring models behave safely and responsibly

Safety Risks

Overconfidence after RLHF
Reward hacking (gaming objective)
Capability regression
Adversarial jailbreaks
Bias amplification

Mitigation Strategies

Constitutional AI: Encode safety rules
Red-teaming: Adversarial testing
Calibration: Uncertainty estimates
KL penalties: Limit divergence

Constitutional AI: Principle-Based Alignment

# Constitutional AI: Define safety principles
principles = [
    "Be helpful and harmless",
    "Don't provide medical advice",
    "Refuse illegal requests",
    "Acknowledge uncertainty"
]

# Red team: Critique unsafe responses
critique_prompt = """
Identify any harmful or unsafe content in the response.
Rate on scale 1-5 where 5 is most harmful.
"""

# Revise: Model improves response based on critique
revision_prompt = """
Provide an improved response that adheres to our principles.
"""

Adversarial Testing

Use benchmarks like AdvBench (100 jailbreak attempts) and OWASP Top 10 for GenAI. Test for: prompt injection, model extraction, bias amplification, hallucinations.

Privacy & Compliance

Data governance, regulation, and secure training

PII Handling

Detection: spaCy NER, Presidio
Redaction: Mask PII tokens
Audit trails: Log data access
Retention: Auto-delete policies

Privacy Techniques

Differential Privacy: Opacus, TensorFlow Privacy
Federated Learning: Train on-device
Secure Enclaves: Hardware isolation
Data minimization: Collect only needed

Compliance Requirements

GDPR

Right to erasure, consent, data portability

HIPAA

Healthcare: encryption, audit logs, BAAs

Model Licensing

Commercial use, CC, proprietary restrictions

Privacy by Design

Plan privacy early: data minimization, consent management, retention policies. Use privacy-preserving techniques (DP, federated) for sensitive domains.

Failure Modes

Common pitfalls and mitigation strategies

Catastrophic Forgetting

Base model capabilities degrade during fine-tuning on specific data.

Fix: Replay buffer mixing base data (10-20%), lower LR, early stopping

Overfitting to Feedback

Model exploits reward function instead of learning genuine behavior.

Fix: KL penalty, diverse reward signals, hold-out test set

Bias & Toxicity Regression

Fine-tuning amplifies harmful biases or toxicity in specific domains.

Fix: Balanced datasets, debiasing, red-teaming, fairness metrics

PPO Instability

PPO training diverges or oscillates in loss/reward.

Fix: Smaller LR (1e-5), fewer PPO epochs, monitor KL divergence

Safety Regressions

Model becomes less safe or generates harmful content after tuning.

Fix: Safety-balanced data, adversarial eval, constitutional AI

Pipeline Failures

Data preprocessing, training, or serving breaks silently.

Fix: Unit tests, integration tests, monitoring, alerting

Implementation Roadmap

5-phase plan from planning to production

Phase 1: Planning & Data Prep

Define objectives, audit/cleanse data, set up MLOps.

1-2 weeks Planning Data audit

Document fine-tuning goals and success metrics
Audit dataset for PII, quality, diversity
Set up MLflow, W&B, or similar tracking
Plan compute budget and timeline

Phase 2: Prototype & Baseline

SFT baseline, evaluate, trial LoRA, measure trade-offs.

2-4 weeks SFT + LoRA Eval

Train small SFT baseline (7B-13B model)
Evaluate on task-specific metrics + safety
Trial LoRA with same data, compare speed/quality
Document results and decide method

Phase 3: Alignment & Safety

Human feedback, reward model, PPO/DPO, red-teaming.

4-8 weeks RLHF/DPO Red-team

Collect human preference data (5K-20K pairs)
Train reward model if using RLHF
Run PPO/DPO training with KL penalties
Red-team for jailbreaks, adversarial examples

Phase 4: Optimization & Testing

Quantization, end-to-end evaluation, A/B testing.

2-4 weeks Quantization A/B test

Apply quantization (INT8, GPTQ) for faster inference
End-to-end latency and throughput testing
Canary deployment (5-10% traffic)
A/B test vs. baseline model, measure win rate

Phase 5: Production Deployment

Full production, monitoring, continuous retraining.

Ongoing CI/CD Monitoring

Deploy to production with versioning
Monitor latency, cost, accuracy, drift
Set up continuous retraining on new data
Plan quarterly reviews and updates

Tools & References

Essential libraries, frameworks, and resources

Training Libraries

HuggingFace Transformers: Base models, Trainer API
PEFT: LoRA, Adapters, Prefix tuning
Accelerate: Distributed training, quantization
DeepSpeed: ZeRO, optimization
Ray Tune: Hyperparameter tuning

Serving & Deployment

BentoML: Model packaging, REST API
MLflow: Model registry, serving
Ray Serve: Distributed inference
KServe: Kubernetes-native
TensorRT-LLM: NVIDIA optimization

Monitoring & Evaluation

Weights & Biases: Experiment logging
TensorBoard: Metrics visualization
Hugging Face Evaluate: Benchmarks
OpenCompass: LLM leaderboards
HELM: Safety evaluation

Data & Infrastructure

HuggingFace Datasets: Data loading
Presidio: PII detection
spaCy: NLP utilities
Kubernetes: Container orchestration
Airflow: Workflow management

Key Papers & Resources

Foundational

LoRA: Low-Rank Adaptation (Hu et al.)
Instruction Tuning with FLAN (Wei et al.)
Training language models to follow instructions (InstructGPT)

Advanced

Direct Preference Optimization (Rafailov et al.)
Constitutional AI (Bai et al.)
Scaling Laws & Chinchilla (DeepMind)

Recommended Starting Stack

SFT: Transformers Trainer + PEFT (LoRA) + W&B. Serving: BentoML or Ray Serve. Monitoring: W&B + custom dashboards. This covers 80% of use cases with minimal overhead.

HuggingFace Base & SFT Models

Foundation Models for Fine-Tuning Across Use Cases

Large Models (13B+) — Maximum Quality

Model	Params	Context	MMLU	License	Best For
meta-llama/Llama-3.1-70B	70B	128K	~86	Llama 3.1	Best open teacher; general SFT for all domains
Qwen/Qwen3-32B	32B	128K	~83	Apache 2.0	Multilingual SFT; strong reasoning; permissive license
Qwen/Qwen3-30B-A3B	30B (3B active)	262K	~82	Apache 2.0	MoE: 30B quality at 3B inference cost; RAG generation
mistralai/Mistral-Large-2	123B	128K	~84	Mistral	Code + reasoning; enterprise applications
microsoft/Phi-4	14B	16K	~78	MIT	High quality per parameter; reasoning-focused

Small Models (1B-8B) — Cost-Efficient Fine-Tuning

Model	Params	MMLU	HumanEval	LoRA VRAM	Best For
meta-llama/Llama-3.3-8B	8B	73.0	72.6	~16GB	Best all-around 8B; recommended starting point
Qwen/Qwen3-8B	8B	~72	~75	~16GB	Best code generation at 8B; strong multilingual
HuggingFaceTB/SmolLM3-3B	3B	~67	~58	~8GB	Best 3B model; full training blueprint published
microsoft/Phi-4-mini-instruct	3.8B	~70	~66	~10GB	Edge deployment; reasoning-heavy tasks
google/gemma-2-9b	9B	~71	~64	~18GB	Google ecosystem; good instruction following
mistralai/Mistral-7B-v0.3	7B	~63	~40	~14GB	Sliding window attention; fast inference
Qwen/Qwen3-1.7B	1.7B	~55	~35	~4GB	Ultra-lightweight; mobile/IoT fine-tuning

# Quick-start: LoRA fine-tuning with any base model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig

# Choose your base model
model_name = "meta-llama/Llama-3.3-8B"  # or Qwen3-8B, SmolLM3-3B, etc.

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Apply LoRA
lora_config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05, task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
print(model.print_trainable_parameters())  # ~0.2% of total params

                

Model Selection Decision Tree

Budget GPU (16GB): Llama-3.3-8B or Qwen3-8B with QLoRA. Minimal GPU (8GB): SmolLM3-3B or Phi-4-mini with QLoRA 4-bit. Maximum quality: Llama-3.1-70B with QLoRA on 2×A100. Multilingual: Qwen3 family (all sizes). Edge/mobile: Qwen3-1.7B or Qwen3-0.6B.

HuggingFace RAG & Embedding Models

Models for Retrieval-Augmented Fine-Tuning

Embedding Models (Fine-Tunable)

Model	Params	Dims	MTEB	Fine-Tune Use Case
Alibaba/Qwen3-Embedding-8B	8B	32-4096	70.58 (#1)	Domain-specific retrieval with custom dims
BAAI/bge-m3	568M	1024	~66	Multilingual RAG; dense + sparse + multi-vector
jinaai/jina-embeddings-v3	570M	1024	~65	Multi-task fine-tuning (retrieval + classification)
BAAI/bge-base-en-v1.5	109M	768	~63	English-only domain adaptation; fast fine-tuning
sentence-transformers/all-MiniLM-L6-v2	22M	384	~56	Ultra-fast; fine-tune for domain similarity tasks
sentence-transformers/all-mpnet-base-v2	109M	768	~60	Best sentence-transformer; STS fine-tuning

Reranker Models (Fine-Tunable)

Model	Params	Context	Fine-Tune Use Case
mixedbread-ai/mxbai-rerank-large-v2	1.5B	8K	Domain reranking; 100+ languages; RL-trained baseline
BAAI/bge-reranker-v2-m3	568M	8K	Multilingual reranking fine-tuning
BAAI/bge-reranker-base	278M	512	Lightweight domain reranker; fast fine-tuning
cross-encoder/ms-marco-MiniLM-L-6-v2	22M	512	Ultra-fast reranker; MS-MARCO pre-trained
colbert-ir/colbertv2.0	110M	512	Late-interaction retrieval; use with RAGatouille

# Fine-tune embedding model for domain-specific retrieval
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load pre-trained embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Domain-specific training pairs
train_examples = [
    InputExample(texts=["patient symptoms", "clinical presentation"], label=0.9),
    InputExample(texts=["patient symptoms", "stock market"], label=0.1),
]

# Contrastive loss for retrieval
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./domain-embeddings"
)

                

RAG Fine-Tuning Strategy

Embedding: Fine-tune bge-base-en-v1.5 on domain pairs with contrastive loss for 3-5 epochs. Reranker: Fine-tune bge-reranker-base with domain query-document relevance scores. Generator: SFT Llama-3.3-8B on domain QA pairs with retrieved context. This 3-stage approach maximizes end-to-end RAG quality.

Alignment & RLHF Models

Models & Tools for DPO, PPO, and Instruction Tuning

Base Models for RLHF / DPO

Model	Params	Pre-Aligned?	DPO-Ready?	Best For
meta-llama/Llama-3.3-8B-Instruct	8B	Yes (SFT + RLHF)	Yes	Further DPO alignment for specific domains
Qwen/Qwen3-8B-Instruct	8B	Yes (SFT + RLHF)	Yes	Multilingual alignment; strong code + reasoning
meta-llama/Llama-3.3-8B	8B	No (base)	After SFT	Full RLHF pipeline: SFT → RM → PPO
HuggingFaceTB/SmolLM3-3B	3B	Yes (instruct)	Yes	Lightweight DPO; resource-constrained alignment
google/gemma-2-9b-it	9B	Yes (IT)	Yes	Safety-focused alignment fine-tuning

Reward Models

Model	Base	Use Case
OpenAssistant/reward-model-deberta-v3-large-v2	DeBERTa-v3-large	General preference scoring; lightweight RM
Nexusflow/Starling-RM-34B	Yi-34B	High-quality reward model; strong correlation with human prefs
allenai/tulu-v2.5-13b-uf-rm	Llama-2-13B	UltraFeedback-trained; open RLHF pipeline

Preference Datasets

For DPO Training

• HuggingFaceH4/ultrafeedback_binarized — 64K preferences; most popular DPO dataset
• argilla/dpo-mix-7k — 7K high-quality curated preferences
• Intel/orca_dpo_pairs — Orca-style DPO training data

For Instruction Tuning

• tatsu-lab/alpaca — 52K GPT-generated instructions; classic dataset
• HuggingFaceH4/no_robots — 10K human-written instructions
• Open-Orca/OpenOrca — 4M+ multi-task instruction pairs

# DPO alignment with TRL
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train")

# DPO training config
dpo_config = DPOConfig(
    output_dir="./dpo-llama",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,
    beta=0.1,  # DPO temperature
    warmup_ratio=0.1,
    logging_steps=10,
    gradient_accumulation_steps=4,
)

# Train with DPO
trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

                

DPO vs RLHF Trade-off

DPO: 40-75% cheaper, more stable training, simpler implementation. RLHF (PPO): 8% unsafe outputs vs DPO's 10% in adversarial tests; better structured reasoning and OOD generalization. Start with DPO; switch to RLHF if safety requirements are strict.

Alignment Pipeline

Step 1: SFT on instruction dataset (Alpaca/OpenOrca). Step 2: DPO with UltraFeedback preferences. Step 3: Red-team evaluation. Step 4: Optional PPO with custom reward model for safety-critical domains.

Glossary of Fine-Tuning Terms

25 key technical terms used throughout this guide, organized alphabetically.

A

Term	Definition
AdamW	The standard optimizer for LLM training: Adam with decoupled weight decay. Typical learning rates for fine-tuning: 1e-5 to 5e-5 for large models, slightly higher for PEFT methods.
Adapter	A small trainable module inserted into frozen Transformer layers. Consists of a down-projection, nonlinearity, and up-projection. Enables task-specific tuning with minimal parameter overhead.
Alpaca	Stanford's instruction-tuning dataset created by generating 52K instruction-response pairs from text-davinci-003. A landmark example of synthetic data for fine-tuning.

C

Term	Definition
Constitutional AI	Anthropic's alignment technique where the model self-critiques outputs against a set of written principles (a "constitution"), reducing reliance on human annotators for safety training.
Curriculum Learning	A training strategy that presents examples in increasing difficulty order (easy→hard), improving convergence and sample efficiency for fine-tuning.

D

Term	Definition
DPO (Direct Preference Optimization)	An alignment method that directly optimizes from preference data using a binary classification loss. 40-75% cheaper than RLHF while achieving comparable alignment quality.

E

Term	Definition
Epoch	One complete pass through the entire training dataset. Fine-tuning typically uses 2-4 epochs. More epochs risk overfitting, especially with small datasets.

F

Term	Definition
Federated Learning	Distributed training where data stays on-device and only model updates are shared. Enables privacy-preserving fine-tuning across multiple data owners without centralizing sensitive data.
Full Fine-Tuning (SFT)	Updating all model parameters on task-specific data. Achieves maximum quality but requires significant GPU memory and compute. 100% of parameters are trainable.

G

Term	Definition
Gradient Accumulation	Simulating larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Enables training with large effective batches on limited GPU memory.
Gradient Checkpointing	Trading compute for memory by recomputing intermediate activations during backpropagation instead of storing them. Reduces memory usage by ~60% at the cost of ~30% slower training.

H

Term	Definition
Human-in-the-Loop (HITL)	An iterative workflow where model outputs are reviewed and corrected by humans, with corrections fed back as training data. Used in both SFT data creation and RLHF preference labeling.

I

Term	Definition
Instruction Tuning	Fine-tuning an LLM on (instruction, response) pairs to improve instruction-following ability. The first alignment step after pre-training. Examples: Alpaca, FLAN, OpenOrca datasets.

L

Term	Definition
LoRA (Low-Rank Adaptation)	Adds trainable low-rank matrices (rank r, typically 4-16) to frozen attention weights: W' = W + α(A×B). Trains 0.1-1% of parameters. Can be merged into base weights at inference for zero overhead.

M

Term	Definition
Mixed Precision (BF16/FP16)	Training with half-precision floating point to reduce memory by 50% and speed up computation. BF16 is preferred for training stability. Requires loss scaling for FP16.

P

Term	Definition
PEFT (Parameter-Efficient Fine-Tuning)	A family of methods (LoRA, adapters, prefix tuning, prompt tuning) that fine-tune only a small fraction (<1%) of model parameters while keeping the rest frozen.
PPO (Proximal Policy Optimization)	The RL algorithm used in RLHF to update the LLM policy. Maximizes expected reward from a reward model while constraining updates with a KL penalty against the base model.
Prefix Tuning	A PEFT method that prepends trainable continuous vectors ("prefixes") to the keys and values in each attention layer. ~0.01% trainable parameters.
Prompt Tuning	The simplest PEFT method: prepending learnable soft tokens to the input. <0.01% trainable parameters. No architectural changes needed. Works best with very large models.

Q

Term	Definition
QLoRA	Quantized LoRA — loads the base model in 4-bit precision (NF4) and adds LoRA adapters on top. Enables fine-tuning a 70B model on a single 48GB GPU.

R

Term	Definition
Reward Model	A model trained to predict human preference scores for LLM outputs. Used in RLHF to provide reward signals for PPO training. Typically trained on pairwise preference data.
RLHF	Reinforcement Learning from Human Feedback — a 3-stage alignment pipeline: (1) SFT on demonstrations, (2) train reward model on preferences, (3) optimize LLM with PPO against the reward model.

S

Term	Definition
Synthetic Data	Training data generated by LLMs rather than human annotators. 10× cheaper than human data. Common approach: use a strong teacher model to generate instruction-response pairs for student fine-tuning.

T

Term	Definition
TRL	Transformer Reinforcement Learning — HuggingFace's library for RLHF and DPO training. Provides SFTTrainer, DPOTrainer, PPOTrainer, and RewardTrainer.

W

Term	Definition
Weight Decay	A regularization technique that adds a penalty proportional to parameter magnitude to the loss. Prevents overfitting during fine-tuning. Typical values: 0.01-0.1.

Full Reference: For a comprehensive glossary covering ALL LLM topics across all documents, see the unified LLM Glossary with 140+ terms.