LLM Pruning
Remove unnecessary weights and layers to build faster, leaner models without sacrificing quality.
What is Pruning?
Structured removal of weights, neurons, layers, and attention heads from trained models to create sparse networks.
Pruning Types
- Weight Pruning: Remove individual weights with smallest magnitude
- Neuron Pruning: Eliminate entire neurons/neurons across all layers
- Layer Pruning: Remove middle transformer layers completely
- Attention Head Pruning: Drop less important multi-head attention heads
Why Pruning?
- Smaller Models: 40–70% model size reduction
- Faster Inference: 2–4× speedup on hardware with sparse support
- Lower Latency: Fewer operations per token
- Deployment: Fits on edge devices, mobile, browsers
Pruning works best when combined with quantization for maximum compression. A 50% sparse + 8-bit quantized model can match dense FP16 models while being 6-8× smaller.
Fundamentals
Core concepts and methods that power modern pruning.
Magnitude Pruning (Baseline)
Simplest pruning method: remove weights with smallest absolute value. Fast, but not always optimal.
# Basic magnitude pruning
import torch
weights = model.layer.weight.data
threshold = torch.quantile(torch.abs(weights), 0.5)
mask = torch.abs(weights) > threshold
weights *= mask # Zero out small weightsImportance Scoring Methods
Magnitude-Based
Score = |w|. Fast but ignores input correlation.
Gradient-Based
Score = |∇L/∂w| × |w|. Uses loss sensitivity.
Hessian-Based
Score = H × |w|². Second-order approximation (expensive).
Activation-Based
Score = |w| × |activation|. WANDA approach (ICLR 2024).
Pruning Schedule
| Method | Description | Pros | Cons |
|---|---|---|---|
| One-Shot | Prune once, no retraining | Fast, minimal compute | Lower quality at high sparsity |
| Iterative | Prune → retrain repeatedly | Higher quality, smoother | Expensive, slow |
| Gradual | Progressive sparsity increase during training | Good quality-cost tradeoff | Requires full retraining |
Sparsity Patterns
Unstructured
Any weight can be pruned. Most flexible but harder to accelerate on hardware.
Structured
Prune entire neurons, heads, or layers. Hardware-friendly but less flexible.
Semi-Structured (N:M)
N zeros per M consecutive values (e.g., 2:4). Hardware acceleration available.
The Lottery Ticket Hypothesis
Implication: a randomly initialized 50% sparse network can reach 99% of dense accuracy with proper training — suggesting over-parameterization is intentional redundancy.
Pruning vs. Distillation vs. Quantization
| Technique | What It Does | Speedup | Quality Loss | Combined? |
|---|---|---|---|---|
| Pruning | Remove weights/layers | 2–4× | 5–20% | Yes with all |
| Quantization | Reduce precision (FP32→INT8) | 2–3× | 2–10% | Yes with all |
| Distillation | Train small model on large model outputs | Variable | 0–5% | Best baseline |
Unstructured Pruning
Remove individual weights to achieve maximum flexibility in sparsity patterns.
WANDA (Pruning by Weights AND Activations) — ICLR 2024
State-of-the-art for one-shot LLM pruning. Prunes weights with smallest magnitude × activation product.
def wanda_prune(model, sparsity_level, calibration_data):
# For each layer output
for layer in model.layers:
weight = layer.weight # shape: [out, in]
activations = layer.get_activations(calibration_data)
# Compute importance: |w| * |a|
w_abs = torch.abs(weight)
a_abs = torch.abs(activations).mean(dim=0) # [in]
importance = w_abs * a_abs.unsqueeze(0)
# Find threshold for target sparsity
threshold = torch.quantile(importance, sparsity_level)
mask = importance > threshold
# Apply pruning
layer.weight.data *= mask.float()
return modelSparseGPT
Layer-wise optimal brain surgeon with second-order approximation. Handles 50–60% unstructured sparsity well.
# Simplified SparseGPT reconstruction
for layer in model.layers:
# Get Hessian inverse (expensive)
H = layer.get_hessian_inverse(calibration_data)
for output_idx in range(layer.out_features):
w = layer.weight[output_idx] # [in]
# Find least important weights
importance = torch.abs(w) / torch.sqrt(H.diagonal())
prune_idx = importance.argsort()[:num_to_prune]
# Update surviving weights to compensate
w[prune_idx] = 0
# ... reconstruction math ...Method Comparison: Wanda vs SparseGPT vs Magnitude
| Method | Scoring | Retraining | Speed | Quality @ 50% Sparse |
|---|---|---|---|---|
| WANDA | |w| × |a| | No | 5 min (LLaMA-7B) | 92–94% perplexity |
| SparseGPT | Hessian-based reconstruction | No | 1–2 hours | 93–95% perplexity |
| Magnitude | |w| only | No | <1 min | 88–90% perplexity |
Structured Pruning
Remove entire neurons, layers, or attention heads for hardware-friendly sparsity.
Layer Removal (ShortGPT)
Remove middle transformer layers while keeping embedding and output layers. NVIDIA's Nemotron: 15B → 8B/4B.
# Remove middle layers and adjust attention
original_layers = 48
target_layers = 24
layers_to_remove = original_layers // 2
# Keep first 12, remove next 24, keep last 12
new_model = Model(
layers=model.layers[:12] + model.layers[36:]
)
# Quick fine-tune on calibration data
fine_tune(new_model, calibration_data, epochs=3)Attention Head Pruning
Remove least important heads based on activation magnitude or gradient flow. Typical: 5–15% of heads are redundant.
# Measure head importance
for layer in model.layers:
multi_head = layer.self_attn
# Score each head
head_scores = []
for h in range(num_heads):
# Average attention weights across batch
head_attn = multi_head.attn_output[h]
score = head_attn.abs().mean()
head_scores.append(score)
# Remove bottom N% heads
heads_to_remove = argsort(head_scores)[:5]
multi_head.num_heads -= 5Width Pruning
Reduce hidden dimensions (FFN inner dimension, embedding dimension). Less common but effective.
Minitron Approach (NVIDIA, 2024)
- 1. Identify and remove least important layers/heads
- 2. Fine-tune on diverse data
- 3. Distill from original Nemotron-15B
- 4. Result: Minitron-8B & 4B match or exceed other 7-8B models
Sheared LLaMA & LLM-Pruner
Sheared LLaMA: Structured pruning of LLaMA-7B tailored to specific downstream tasks, achieving 15–30% fewer parameters.
LLM-Pruner: Task-agnostic structured pruning using importance scores from gradient flow. Works on any LLM.
Structured Pruning Pros & Cons
Advantages
- ✓ Hardware-friendly (drop layers, narrow tensors)
- ✓ Easy to implement
- ✓ Works with standard training pipelines
- ✓ Minimal memory overhead
Disadvantages
- ✗ Less flexible (can't remove single weights)
- ✗ Layer collapse at high sparsity
- ✗ May require more retraining
- ✗ Quality-sparsity tradeoff worse than unstructured
N:M Sparsity
Semi-structured sparsity with hardware acceleration support.
N:M sparsity means exactly N zeros per M consecutive values. Most common: 2:4 (50% sparsity) and 4:8 (50% sparsity).
Hardware Support
| Hardware | Pattern Support | Acceleration | Speedup @ 2:4 |
|---|---|---|---|
| NVIDIA A100 (Ampere) | 2:4, fine-grained | Sparse tensor cores | ~2× |
| NVIDIA H100 (Hopper) | 2:4, fine-grained | Sparse tensor cores | ~2.5× |
| Intel Gaudi | Structured only | Limited | ~1.2× |
| CPU | Any | None (dense compute) | ~1× |
WANDA Extended to N:M
Apply WANDA importance scores but enforce N:M constraint: for each group of M values, keep only N highest-importance ones.
def wanda_nm_prune(weight, activations, n, m):
# Compute importance
importance = torch.abs(weight) * activations.abs().mean(dim=0)
# Reshape for N:M constraint
out_features, in_features = weight.shape
importance_reshaped = importance.view(out_features, in_features // m, m)
# For each group, keep top N values
mask = torch.zeros_like(importance_reshaped)
for i in range(out_features):
for j in range(in_features // m):
group = importance_reshaped[i, j]
top_indices = torch.topk(group, n)[1]
mask[i, j, top_indices] = 1
mask = mask.view_as(weight)
return weight * maskTrade-offs
2:4 Sparsity (50%)
- ✓ 2× speedup on A100/H100
- ✓ Minimal quality loss (1–3%)
- ✓ Good practical balance
4:8 Sparsity (50%)
- ✓ More aggressive sparsity pattern
- ✓ Quality loss similar to 2:4
- ✗ Less hardware support
Progressive Pruning
Iterative pruning with retraining for higher quality at extreme sparsity.
Iterative Magnitude Pruning (IMP)
Repeatedly prune, retrain, and reset to original magnitude.
def iterative_magnitude_pruning(model, target_sparsity, num_iterations):
original_weights = {k: v.clone() for k, v in model.state_dict().items()}
for iteration in range(num_iterations):
# Prune
pruning_amount = target_sparsity / num_iterations
apply_magnitude_pruning(model, pruning_amount)
# Retrain
optimizer = AdamW(model.parameters(), lr=1e-4)
for epoch in range(5):
for batch in train_loader:
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
# Reset magnitude to original (lottery ticket)
for name, param in model.named_parameters():
mask = param.data != 0
param.data[mask] = original_weights[name][mask]
return modelGradual Pruning During Training
Increase sparsity gradually during the full training run using polynomial schedules.
# Cubic pruning schedule
def cubic_schedule(step, total_steps, initial_sparsity, target_sparsity):
progress = step / total_steps
sparsity = (target_sparsity - initial_sparsity) * (progress ** 3) + initial_sparsity
return sparsity
# During training loop
for step, batch in enumerate(train_loader):
current_sparsity = cubic_schedule(step, total_steps, 0, 0.5)
apply_pruning(model, current_sparsity)
logits = model(batch)
loss = compute_loss(logits, batch)
loss.backward()
optimizer.step()Lottery Ticket Rewinding
After pruning, reset weights to their values from an early training checkpoint, then continue retraining.
One-Shot vs Progressive Trade-offs
| Method | Quality @ 50% | Quality @ 80% | Training Time | Practical Use |
|---|---|---|---|---|
| One-Shot (WANDA) | 92% | 75% | 5 min | Quick iteration |
| Iterative (5 rounds) | 94% | 82% | 3–5 hours | Higher quality target |
| Gradual (full training) | 95% | 85% | 20 hours | From-scratch training |
Training Recipes
Practical pipelines for different pruning scenarios.
Post-Training Pruning (No Retraining)
Best for quick results. Use WANDA or SparseGPT for quality one-shot pruning.
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
# Calibrate on small sample (128 tokens × 8 samples)
calibration_data = load_calibration_data(n_samples=8, seq_len=128)
# Apply WANDA (5 minutes for 7B)
model = wanda_prune(model, sparsity_level=0.5, calibration_data)
# Evaluate
perplexity = evaluate_perplexity(model, validation_data)
print(f"50% sparse model: {perplexity} perplexity")Prune-Then-Retrain
Prune once, then fine-tune on task-specific data for 1–3 epochs.
# 1. Prune
model = wanda_prune(model, sparsity_level=0.6, calibration_data)
# 2. Fine-tune on downstream task
optimizer = AdamW(model.parameters(), lr=1e-4)
for epoch in range(3):
for batch in task_train_loader:
logits = model(batch["input_ids"])
loss = F.cross_entropy(logits, batch["labels"])
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 3. Evaluate on downstream task
accuracy = evaluate_downstream(model, task_val_loader)
print(f"Task accuracy: {accuracy}")Pruning + Distillation (Minitron Style)
Combine structured pruning with knowledge distillation for best quality.
# 1. Identify layers to prune (use gradient importance)
important_layers = identify_important_layers(model, calibration_data)
# 2. Create pruned architecture
pruned_model = create_pruned_model(model, important_layers)
# 3. Distill from teacher
teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-15b")
teacher_model.eval()
kl_loss_fn = nn.KLDivLoss(reduction='batchmean')
optimizer = AdamW(pruned_model.parameters(), lr=5e-5)
for batch in distill_train_loader:
student_logits = pruned_model(batch)
with torch.no_grad():
teacher_logits = teacher_model(batch)
# KL divergence loss at temperature T=3
T = 3
loss = kl_loss_fn(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1)
)
loss.backward()
optimizer.step()Pruning + Quantization (Maximum Compression)
Combine 50% sparsity + INT8 quantization for 8–12× total compression.
# 1. Prune to 50% sparsity
model = wanda_prune(model, sparsity_level=0.5, calibration_data)
# 2. Quantize to INT8 (using bitsandbytes)
from bitsandbytes.nn import Linear8bitLt
quantized_model = convert_to_8bit(model) # bitsandbytes
# 3. Quick recalibration
for batch in calibration_data:
_ = quantized_model(batch)
# Result: 50% sparse + 8-bit = 6.4× smaller, 4–6× faster
print(f"Final size: {model.get_memory_footprint() / 1e9} GB")Benchmarks
Real-world performance metrics across models and methods.
Perplexity vs Sparsity (LLaMA-7B, WikiText-2)
| Sparsity | Dense Baseline | Magnitude | WANDA | SparseGPT |
|---|---|---|---|---|
| 0% (Dense) | 5.68 | — | — | — |
| 30% | — | 5.95 | 5.81 | 5.78 |
| 50% | — | 6.52 | 6.09 | 6.03 |
| 60% | — | 7.41 | 6.65 | 6.51 |
| 70% | — | 9.12 | 7.45 | 7.22 |
Quality Retention at Different Sparsity Levels
| Sparsity Level | Zero-Shot Accuracy (Avg) | Task: MMLU | Task: HellaSwag |
|---|---|---|---|
| Dense | 100% | 32% | 78% |
| 50% (WANDA) | 99% | 31% | 77% |
| 60% (WANDA) | 97% | 29% | 74% |
| 70% (WANDA) | 91% | 26% | 68% |
| 80% (WANDA) | 75% | 18% | 52% |
Speed Improvements with N:M Sparsity
| Hardware | 2:4 Sparsity | Token/s (Dense) | Token/s (Sparse) | Speedup |
|---|---|---|---|---|
| A100 (40GB) | Yes | 450 | 920 | 2.04× |
| H100 (80GB) | Yes | 580 | 1450 | 2.50× |
| A10 GPU | No | 120 | 125 | 1.04× |
| CPU (AMD EPYC) | No | 5 | 6 | 1.2× |
Model Size & Memory Savings
| Model | Original Size | 50% Sparse | 50% Sparse + INT8 | Memory Saved |
|---|---|---|---|---|
| LLaMA-7B (FP16) | 13 GB | 6.5 GB | 3.3 GB | 75% |
| LLaMA-13B (FP16) | 26 GB | 13 GB | 6.5 GB | 75% |
| Llama 2-70B (FP16) | 140 GB | 70 GB | 35 GB | 75% |
Model Directory
Pre-pruned models and base models suitable for pruning.
Pre-Pruned Models on Hugging Face
| Model | Base Model | Pruning Method | Sparsity | Quality | Link |
|---|---|---|---|---|---|
| Minitron-8B | Nemotron-15B | Structured + Distill | 47% | Matches 7B baseline | nvidia/Minitron-8B |
| Minitron-4B | Nemotron-15B | Structured + Distill | 73% | Matches 3B baseline | nvidia/Minitron-4B |
| Sparse-LLaMA-7B | LLaMA-7B | WANDA (unstructured) | 50% | 93% baseline quality | vwxyzjn/sparse-llama-7b |
| Sheared-LLaMA | LLaMA-7B | Structured task-aware | 20–30% | Task-dependent | princeton-nlp/sheared-llama-* |
Recommended Base Models for Pruning
Llama 2 Series
- meta-llama/Llama-2-7b
- meta-llama/Llama-2-13b
- meta-llama/Llama-2-70b
- Easy to prune, well-documented
Mistral Series
- mistralai/Mistral-7B-v0.1
- mistralai/Mixtral-8x7B
- Good efficiency before pruning
Phi Series
- microsoft/phi-1.5
- microsoft/phi-2
- Small, prunes well
OLMo Series
- allenai/OLMo-1B
- allenai/OLMo-7B
- Research models, open
Deployment
Running pruned models in production with optimized inference.
DeepSparse Engine (Neural Magic)
Optimized CPU inference for sparse models. Perfect for edge deployment without GPU.
from deepsparse import Pipeline
# Load sparse model
pipeline = Pipeline.create(
task="text_generation",
model_path="zoo:nlp/text_generation/openai-gpt2/pruned_quantized",
engine_type="deepsparse"
)
# Inference (optimized for sparsity)
result = pipeline(
prompt="The future of AI is",
max_length=100
)
print(result.generations[0].text)NVIDIA Sparse Tensor Cores (A100/H100)
Use structured (2:4) sparsity with NVIDIA's CUTLASS or TensorRT-LLM.
# Use TensorRT-LLM with sparse plugins
import tensorrt_llm as trt_llm
# Build sparse model engine
builder = trt_llm.Builder()
# (auto-detects 2:4 sparsity in weights)
engine = builder.build_engine(
model_path="./llama-7b-50-sparse.safetensors",
precision="float16",
max_batch_size=32
)
# Serialize and deploy
engine.save("./llama-sparse.engine")vLLM Integration
Use vLLM's PagedAttention with sparse models for efficient batching.
from vllm import LLM, SamplingParams
# Load sparse model
llm = LLM(
model="sparse-llama-7b:50-percent",
dtype="float16",
tensor_parallel_size=1,
gpu_memory_utilization=0.7
)
prompts = ["What is AI?", "How does ML work?"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)Custom Sparse Kernel Implementation
For ultimate control, implement custom sparse matrix-vector multiplication (e.g., in CUDA/Triton).
Cost Analysis
Economic impact of pruning in training and inference.
Training Cost Comparison
| Method | Time (7B Model) | Compute Cost | GPU Hours | Cost @ $1/hr |
|---|---|---|---|---|
| WANDA (one-shot) | 5 min | 1× (calibration) | 0.08 | $0.08 |
| SparseGPT (one-shot) | 1 hour | 1× (reconstruction) | 1 | $1.00 |
| Iterative (5 rounds) | 3–5 hours | 5× (retraining) | 20 | $20.00 |
| From-scratch training | 20 hours | 1× (full training) | 20 | $20.00 |
Inference Savings (Annual Estimate)
Scenario: 1M requests/day, 10 days retention
| Model Setup | A100 GPUs Needed | Annual Cost (@ $50k/yr) | Savings vs Dense |
|---|---|---|---|
| Dense LLaMA-7B | 8 | $400k | — |
| 50% Sparse (WANDA) | 4 | $200k | 50% savings |
| 50% Sparse + Quantized | 2 | $100k | 75% savings |
Pruning vs Quantization Trade-offs
| Technique | Size Reduction | Speedup | Quality Loss | Ease of Use |
|---|---|---|---|---|
| Quantization Alone (INT8) | 4× | 2–3× | 2–10% | Easy |
| Pruning Alone (50%) | 2× | 2–4× (with H/W support) | 5–10% | Medium |
| Both (50% Sparse + INT8) | 8× | 4–6× | 8–15% | Hard |
Cloud GPU Pricing Reference (March 2026)
| GPU | VRAM | On-Demand $/hr | Spot $/hr | Pruning Use |
|---|---|---|---|---|
| A100 80GB | 80GB | $2.00-3.00 | $1.00-1.80 | WANDA/SparseGPT on 7-13B; 2:4 sparsity inference |
| H100 SXM | 80GB | $2.40-4.00 | $1.50-2.50 | N:M sparse tensor core acceleration; best throughput |
| 4× A100 80GB | 320GB | $8.00-12.00 | $4.00-7.00 | Structured pruning + retraining for 70B models |
Pruned Self-Host vs API — Total Cost (1M req/day, 500 tok/req)
| Approach | Monthly Cost | Annual Cost | Latency (P50) | vs GPT-4o API |
|---|---|---|---|---|
| GPT-4o API | $7,500 | $90,000 | 500-2000ms | Baseline |
| GPT-4o-mini API | $1,125 | $13,500 | 200-800ms | 85% cheaper |
| Llama-8B Dense (A100) | $2,160 | $25,920 | 50-150ms | 71% cheaper |
| Llama-8B 50% Sparse (A100, 2:4) | $1,440 | $17,280 | 25-80ms | 81% cheaper |
| Llama-8B Sparse + INT4 (A40) | $576 | $6,912 | 30-100ms | 92% cheaper |
When to Prune vs Just Quantize
Prune If:
- ✓ You have H/W with sparse support (A100/H100 tensor cores)
- ✓ Model latency is critical (<50ms P99)
- ✓ You need 4-6x combined speedup (sparse + quant)
- ✓ You can spare 5 min (WANDA) to 5 hours per model
- ✓ Serving cost is >$2K/month and you need deeper savings
Quantize Instead If:
- ✓ You have no sparse H/W (consumer GPUs, CPU)
- ✓ You need quick deployment (<1 hour)
- ✓ Model size (VRAM) is the bottleneck, not throughput
- ✓ 2-3x speedup is acceptable for your SLA
- ✓ Quality sensitivity is very high (<1% loss tolerance)
Failure Modes & Mitigation
Common pitfalls and how to avoid them.
Over-Pruning Causing Catastrophic Loss
Mitigation Strategies
- 1. Use progressive/iterative pruning instead of one-shot at high sparsity
- 2. Validate perplexity at each pruning step
- 3. Stay <70% sparsity unless using distillation
- 4. Use importance-aware scoring (WANDA, not magnitude alone)
Layer Collapse in Structured Pruning
# Detect layer collapse during pruning
def detect_collapse(model, calibration_data):
activations = model.get_layer_activations(calibration_data)
for layer_name, act in activations.items():
# Check if output distribution is degenerate
mean = act.abs().mean()
std = act.std()
if std / mean < 0.1: # Low variance
print(f"WARNING: {layer_name} may be collapsing")
return True
return FalseMitigation Strategies
- 1. Monitor layer output statistics during pruning
- 2. Use activation-aware importance (WANDA uses this)
- 3. Retrain for 2–3 epochs after structured pruning
- 4. Preserve skip connections; don't prune residual paths
Sparsity Pattern Hardware Incompatibility
Mitigation Strategies
- 1. Use 2:4 structured sparsity for NVIDIA GPUs
- 2. Use DeepSparse Engine for CPU deployment
- 3. Verify speedup in your target hardware before production
- 4. Consider structured pruning instead if sparsity speedup fails
Retraining Instability After Aggressive Pruning
# Stable retraining recipe
optimizer = AdamW(model.parameters(), lr=1e-5) # 10× smaller than typical
scheduler = CosineAnnealingLR(optimizer, T_max=10)
for epoch in range(10):
for batch in train_loader:
logits = model(batch)
loss = F.cross_entropy(logits, batch["labels"])
# Gradient clipping for stability
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()Mitigation Strategies
- 1. Use 10× smaller learning rate for post-pruning retraining
- 2. Apply gradient clipping (norm ≤ 1.0)
- 3. Warm up learning rate gradually
- 4. Use fewer epochs (2–5) but smaller batches
- 5. Consider layer-wise adaptive retraining rates
Common Debugging Checklist
- ☐ Is pruned model's speedup actually measured on target hardware?
- ☐ Are layer outputs collapsing to near-zero?
- ☐ Did you validate perplexity on calibration data?
- ☐ Are weights being zeroed correctly (check sparsity %)?
- ☐ Is retraining learning rate too high?
- ☐ Did you preserve batch norm statistics post-pruning?
- ☐ Are skip connections preserved in structured pruning?
Tools & Frameworks
Software ecosystem for LLM pruning.
LLM-Pruning Collection (Jan 2026)
JAX-based pruning framework from Princeton Z-Lab. Supports WANDA, SparseGPT, structured pruning with unified API. Latest version (Jan 2026) includes N:M support.
Neural Magic SparseML & DeepSparse
SparseML
PyTorch-based toolkit for creating, training, and fine-tuning sparse models. Easy integration with Hugging Face transformers.
from sparseml.transformers import SparseAutoModelForCausalLM
model = SparseAutoModelForCausalLM.from_pretrained(
"zoo:nlp/question_answering/bert/pruned"
)
# Fine-tune sparse model
trainer = Trainer(model=model, args=args, train_dataset=train_data)
trainer.train()DeepSparse
CPU inference engine optimized for sparse models. Drop-in replacement for PyTorch that accelerates sparse weights.
NVIDIA NeMo & Minitron
NVIDIA's NeMo framework includes structured pruning utilities and distillation. Minitron models are reference implementations.
from nemo.collections.nlp.models import MegatronGPTModel
# Load Minitron model
model = MegatronGPTModel.from_pretrained("nvidia/Minitron-8B")
# Use with Apex for mixed precision
model = model.half()
outputs = model.forward(input_ids)PyTorch Native torch.nn.utils.prune
Built-in PyTorch module for magnitude-based unstructured pruning. Good for simple use cases.
import torch.nn.utils.prune as prune
# Prune 50% of weights in a layer
prune.l1_unstructured(model.layer.weight, name="weight", amount=0.5)
# Make sparsity permanent
prune.remove(model.layer, "weight")Open-Source Implementations
| Project | Method | Framework | GitHub |
|---|---|---|---|
| WANDA | Weights & Activations | PyTorch | locuslab/WANDA |
| SparseGPT | Layer-wise reconstruction | PyTorch | IST-DM/SparseGPT |
| Sheared LLaMA | Structured task-aware | PyTorch | princeton-nlp/sheared_llama |
| LLM-Pruner | Task-agnostic structured | PyTorch | horseee/LLM-Pruner |
Research Papers
Key publications in LLM pruning and sparsity.
Foundational Pruning
WANDA: Pruning by Weights and Activations (2024)
Sun et al., ICLR 2024
One-shot unstructured pruning using |w| × |a| importance scoring. No retraining, matches SparseGPT quality.
SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot (2023)
Frantar & Alistarh, ICML 2023
Layer-wise Hessian-based reconstruction. Handles 50–60% unstructured sparsity with high quality.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (2019)
Frankle & Carbin, ICLR 2019
Random sparse networks can match dense accuracy. Foundation for understanding why pruning works.
Structured Pruning for LLMs
Minitron: Towards Efficient Vision Transformers (NVIDIA, 2024)
Structured pruning + distillation. Nemotron-15B → 8B/4B with superior quality.
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2024)
Xia et al.
Task-aware structured pruning of LLaMA-7B. 15–30% parameter reduction with task-specific gains.
LLM-Pruner: On the Structural Pruning of Large Language Models (2024)
Ma et al.
Task-agnostic structured pruning via gradient-based importance scoring. Works on any LLM architecture.
Related Topics
Quantization
QLoRA, GPTQ, AWQ – Often combined with pruning for maximum compression.
Knowledge Distillation
Minitron, DistilBERT – Best baseline for high-quality sparse models.
Reading List
- Lottery Ticket Hypothesis (foundational intuition)
- SparseGPT (understand layer-wise approach)
- WANDA (modern one-shot method)
- Minitron (structured + distillation)
- LLM-Pruner (task-agnostic structured)
Glossary of Pruning Terms
16 key technical terms used throughout this guide, organized alphabetically.
2
| Term | Definition |
|---|---|
| 2:4 Sparsity (N:M) | A semi-structured sparsity pattern where exactly 2 out of every 4 consecutive weights are zero. Supported natively by NVIDIA Ampere/Hopper tensor cores for 2× inference speedup. |
D
| Term | Definition |
|---|---|
| DeepSparse | Neural Magic's inference engine optimized for running sparse models on CPUs, achieving GPU-like performance for pruned and quantized models without requiring GPU hardware. |
G
| Term | Definition |
|---|---|
| Global Pruning | Ranking and removing weights across the entire model based on a single global importance threshold. Produces better results than layer-wise pruning but requires analyzing all weights simultaneously. |
I
| Term | Definition |
|---|---|
| Iterative Pruning | Gradually increasing sparsity over multiple training rounds (prune → retrain → prune → retrain). Achieves higher sparsity with less quality loss than one-shot methods but costs more compute. |
L
| Term | Definition |
|---|---|
| Layer-wise Pruning | Pruning each layer independently to a target sparsity ratio. Simpler than global pruning but may over-prune sensitive layers. Some methods (ShortGPT) identify and remove entire layers. |
| Lottery Ticket Hypothesis | The theory (Frankle & Carlin, 2019) that dense networks contain sparse subnetworks ("winning tickets") that can match the full network's performance when trained in isolation from their initial weights. |
M
| Term | Definition |
|---|---|
| Magnitude Pruning | The simplest pruning method: removing weights with the smallest absolute values. Fast but suboptimal — small weights may still be important for certain inputs. |
| Minitron | NVIDIA's approach combining structured pruning with knowledge distillation. Compressed Nemotron 15B to 8B and 4B models that match or exceed other models in their size class. |
O
| Term | Definition |
|---|---|
| One-Shot Pruning | Pruning to the target sparsity in a single step without retraining. Fast (minutes) but may have higher quality loss. WANDA and SparseGPT are the leading one-shot methods. |
P
| Term | Definition |
|---|---|
| Pruning | Removing redundant weights, neurons, attention heads, or entire layers from a neural network to reduce size and increase speed. Can be unstructured (individual weights) or structured (entire components). |
S
| Term | Definition |
|---|---|
| ShortGPT | A structured pruning method that identifies and removes entire Transformer layers based on their contribution to the output. Found that middle layers often contribute least. |
| SparseGPT | A one-shot pruning method using approximate second-order reconstruction to optimally adjust remaining weights after pruning. Achieves 50-60% sparsity on LLMs with minimal quality loss in ~1 GPU-hour. |
| Sparsity | The fraction of zero-valued weights in a model. 50% sparsity means half the weights are zero. Higher sparsity = more compression but more quality risk. |
| Structured Pruning | Removing entire structural components (attention heads, neurons, layers, channels) rather than individual weights. Produces actual speedups on standard hardware without sparse matrix support. |
U
| Term | Definition |
|---|---|
| Unstructured Pruning | Setting individual weights to zero based on importance scores. Achieves higher compression ratios than structured pruning but requires specialized sparse hardware/software for actual speedups. |
W
| Term | Definition |
|---|---|
| WANDA | Weights AND Activations — a one-shot pruning method that scores weights by (magnitude × input activation norm). Requires only 5 minutes and a small calibration set. No retraining needed. Published at ICLR 2024. |