LLM Pruning

Remove unnecessary weights and layers to build faster, leaner models without sacrificing quality.

50–90%
Sparsity Achieved
90–99%
Quality Retained
2–4×
Speedup
40–70%
Size Reduction

What is Pruning?

Structured removal of weights, neurons, layers, and attention heads from trained models to create sparse networks.

Dense Model Prune Sparse Model

Pruning Types

  • Weight Pruning: Remove individual weights with smallest magnitude
  • Neuron Pruning: Eliminate entire neurons/neurons across all layers
  • Layer Pruning: Remove middle transformer layers completely
  • Attention Head Pruning: Drop less important multi-head attention heads

Why Pruning?

  • Smaller Models: 40–70% model size reduction
  • Faster Inference: 2–4× speedup on hardware with sparse support
  • Lower Latency: Fewer operations per token
  • Deployment: Fits on edge devices, mobile, browsers

Pruning works best when combined with quantization for maximum compression. A 50% sparse + 8-bit quantized model can match dense FP16 models while being 6-8× smaller.

Fundamentals

Core concepts and methods that power modern pruning.

Magnitude Pruning (Baseline)

Simplest pruning method: remove weights with smallest absolute value. Fast, but not always optimal.

# Basic magnitude pruning import torch weights = model.layer.weight.data threshold = torch.quantile(torch.abs(weights), 0.5) mask = torch.abs(weights) > threshold weights *= mask # Zero out small weights

Importance Scoring Methods

Magnitude-Based

Score = |w|. Fast but ignores input correlation.

Gradient-Based

Score = |∇L/∂w| × |w|. Uses loss sensitivity.

Hessian-Based

Score = H × |w|². Second-order approximation (expensive).

Activation-Based

Score = |w| × |activation|. WANDA approach (ICLR 2024).

Pruning Schedule

Method Description Pros Cons
One-Shot Prune once, no retraining Fast, minimal compute Lower quality at high sparsity
Iterative Prune → retrain repeatedly Higher quality, smoother Expensive, slow
Gradual Progressive sparsity increase during training Good quality-cost tradeoff Requires full retraining

Sparsity Patterns

Unstructured

Any weight can be pruned. Most flexible but harder to accelerate on hardware.

Structured

Prune entire neurons, heads, or layers. Hardware-friendly but less flexible.

Semi-Structured (N:M)

N zeros per M consecutive values (e.g., 2:4). Hardware acceleration available.

The Lottery Ticket Hypothesis

Key Finding
Dense neural networks contain subnetworks that, when isolated and trained in isolation, can match or exceed the performance of the full network. These "lottery tickets" are sparse and trainable from scratch.

Implication: a randomly initialized 50% sparse network can reach 99% of dense accuracy with proper training — suggesting over-parameterization is intentional redundancy.

Pruning vs. Distillation vs. Quantization

Technique What It Does Speedup Quality Loss Combined?
Pruning Remove weights/layers 2–4× 5–20% Yes with all
Quantization Reduce precision (FP32→INT8) 2–3× 2–10% Yes with all
Distillation Train small model on large model outputs Variable 0–5% Best baseline

Unstructured Pruning

Remove individual weights to achieve maximum flexibility in sparsity patterns.

WANDA (Pruning by Weights AND Activations) — ICLR 2024

State-of-the-art for one-shot LLM pruning. Prunes weights with smallest magnitude × activation product.

Key Innovation: Per-output basis pruning without retraining. WANDA matches SparseGPT quality with 10× less compute.
def wanda_prune(model, sparsity_level, calibration_data): # For each layer output for layer in model.layers: weight = layer.weight # shape: [out, in] activations = layer.get_activations(calibration_data) # Compute importance: |w| * |a| w_abs = torch.abs(weight) a_abs = torch.abs(activations).mean(dim=0) # [in] importance = w_abs * a_abs.unsqueeze(0) # Find threshold for target sparsity threshold = torch.quantile(importance, sparsity_level) mask = importance > threshold # Apply pruning layer.weight.data *= mask.float() return model
✓ No Retraining Fast (minutes) Matches SparseGPT

SparseGPT

Layer-wise optimal brain surgeon with second-order approximation. Handles 50–60% unstructured sparsity well.

Method: For each layer, solve per-output reconstruction problem using inverse Hessian approximation. Rearranges remaining weights to minimize layer output error.
# Simplified SparseGPT reconstruction for layer in model.layers: # Get Hessian inverse (expensive) H = layer.get_hessian_inverse(calibration_data) for output_idx in range(layer.out_features): w = layer.weight[output_idx] # [in] # Find least important weights importance = torch.abs(w) / torch.sqrt(H.diagonal()) prune_idx = importance.argsort()[:num_to_prune] # Update surviving weights to compensate w[prune_idx] = 0 # ... reconstruction math ...

Method Comparison: Wanda vs SparseGPT vs Magnitude

Method Scoring Retraining Speed Quality @ 50% Sparse
WANDA |w| × |a| No 5 min (LLaMA-7B) 92–94% perplexity
SparseGPT Hessian-based reconstruction No 1–2 hours 93–95% perplexity
Magnitude |w| only No <1 min 88–90% perplexity
Hardware Limitation
Unstructured sparsity requires sparse tensor operations. Most hardware (GPUs, TPUs) don't efficiently support arbitrary sparse patterns. NVIDIA GPUs offer limited sparse support; CPUs are even slower. Use structured pruning or N:M for practical deployment.

Structured Pruning

Remove entire neurons, layers, or attention heads for hardware-friendly sparsity.

Layer Removal (ShortGPT)

Remove middle transformer layers while keeping embedding and output layers. NVIDIA's Nemotron: 15B → 8B/4B.

# Remove middle layers and adjust attention original_layers = 48 target_layers = 24 layers_to_remove = original_layers // 2 # Keep first 12, remove next 24, keep last 12 new_model = Model( layers=model.layers[:12] + model.layers[36:] ) # Quick fine-tune on calibration data fine_tune(new_model, calibration_data, epochs=3)

Attention Head Pruning

Remove least important heads based on activation magnitude or gradient flow. Typical: 5–15% of heads are redundant.

# Measure head importance for layer in model.layers: multi_head = layer.self_attn # Score each head head_scores = [] for h in range(num_heads): # Average attention weights across batch head_attn = multi_head.attn_output[h] score = head_attn.abs().mean() head_scores.append(score) # Remove bottom N% heads heads_to_remove = argsort(head_scores)[:5] multi_head.num_heads -= 5

Width Pruning

Reduce hidden dimensions (FFN inner dimension, embedding dimension). Less common but effective.

Minitron Approach (NVIDIA, 2024)

Structured Pruning + Knowledge Distillation
  • 1. Identify and remove least important layers/heads
  • 2. Fine-tune on diverse data
  • 3. Distill from original Nemotron-15B
  • 4. Result: Minitron-8B & 4B match or exceed other 7-8B models

Sheared LLaMA & LLM-Pruner

Sheared LLaMA: Structured pruning of LLaMA-7B tailored to specific downstream tasks, achieving 15–30% fewer parameters.

LLM-Pruner: Task-agnostic structured pruning using importance scores from gradient flow. Works on any LLM.

Structured Pruning Pros & Cons

Advantages

  • ✓ Hardware-friendly (drop layers, narrow tensors)
  • ✓ Easy to implement
  • ✓ Works with standard training pipelines
  • ✓ Minimal memory overhead

Disadvantages

  • ✗ Less flexible (can't remove single weights)
  • ✗ Layer collapse at high sparsity
  • ✗ May require more retraining
  • ✗ Quality-sparsity tradeoff worse than unstructured

N:M Sparsity

Semi-structured sparsity with hardware acceleration support.

N:M sparsity means exactly N zeros per M consecutive values. Most common: 2:4 (50% sparsity) and 4:8 (50% sparsity).

Dense: 1.2 -0.5 0.8 2.1 -0.3 1.5 0.2 -0.9 2:4: 1.2 0 0.8 0 1.5 0 0.2 0 8 values total (dense) 4 zeros (50%), 4 non-zero (per 4 values: 2 zeros)

Hardware Support

Hardware Pattern Support Acceleration Speedup @ 2:4
NVIDIA A100 (Ampere) 2:4, fine-grained Sparse tensor cores ~2×
NVIDIA H100 (Hopper) 2:4, fine-grained Sparse tensor cores ~2.5×
Intel Gaudi Structured only Limited ~1.2×
CPU Any None (dense compute) ~1×

WANDA Extended to N:M

Apply WANDA importance scores but enforce N:M constraint: for each group of M values, keep only N highest-importance ones.

def wanda_nm_prune(weight, activations, n, m): # Compute importance importance = torch.abs(weight) * activations.abs().mean(dim=0) # Reshape for N:M constraint out_features, in_features = weight.shape importance_reshaped = importance.view(out_features, in_features // m, m) # For each group, keep top N values mask = torch.zeros_like(importance_reshaped) for i in range(out_features): for j in range(in_features // m): group = importance_reshaped[i, j] top_indices = torch.topk(group, n)[1] mask[i, j, top_indices] = 1 mask = mask.view_as(weight) return weight * mask

Trade-offs

2:4 Sparsity (50%)

  • ✓ 2× speedup on A100/H100
  • ✓ Minimal quality loss (1–3%)
  • ✓ Good practical balance

4:8 Sparsity (50%)

  • ✓ More aggressive sparsity pattern
  • ✓ Quality loss similar to 2:4
  • ✗ Less hardware support

Progressive Pruning

Iterative pruning with retraining for higher quality at extreme sparsity.

Iterative Magnitude Pruning (IMP)

Repeatedly prune, retrain, and reset to original magnitude.

def iterative_magnitude_pruning(model, target_sparsity, num_iterations): original_weights = {k: v.clone() for k, v in model.state_dict().items()} for iteration in range(num_iterations): # Prune pruning_amount = target_sparsity / num_iterations apply_magnitude_pruning(model, pruning_amount) # Retrain optimizer = AdamW(model.parameters(), lr=1e-4) for epoch in range(5): for batch in train_loader: loss = compute_loss(model, batch) loss.backward() optimizer.step() # Reset magnitude to original (lottery ticket) for name, param in model.named_parameters(): mask = param.data != 0 param.data[mask] = original_weights[name][mask] return model

Gradual Pruning During Training

Increase sparsity gradually during the full training run using polynomial schedules.

# Cubic pruning schedule def cubic_schedule(step, total_steps, initial_sparsity, target_sparsity): progress = step / total_steps sparsity = (target_sparsity - initial_sparsity) * (progress ** 3) + initial_sparsity return sparsity # During training loop for step, batch in enumerate(train_loader): current_sparsity = cubic_schedule(step, total_steps, 0, 0.5) apply_pruning(model, current_sparsity) logits = model(batch) loss = compute_loss(logits, batch) loss.backward() optimizer.step()

Lottery Ticket Rewinding

After pruning, reset weights to their values from an early training checkpoint, then continue retraining.

Why It Works
Early training initializations often find good lottery tickets. Rewinding to these checkpoints + pruning can outperform training from scratch.

One-Shot vs Progressive Trade-offs

Method Quality @ 50% Quality @ 80% Training Time Practical Use
One-Shot (WANDA) 92% 75% 5 min Quick iteration
Iterative (5 rounds) 94% 82% 3–5 hours Higher quality target
Gradual (full training) 95% 85% 20 hours From-scratch training

Training Recipes

Practical pipelines for different pruning scenarios.

Post-Training Pruning (No Retraining)

Best for quick results. Use WANDA or SparseGPT for quality one-shot pruning.

# Load pre-trained model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") # Calibrate on small sample (128 tokens × 8 samples) calibration_data = load_calibration_data(n_samples=8, seq_len=128) # Apply WANDA (5 minutes for 7B) model = wanda_prune(model, sparsity_level=0.5, calibration_data) # Evaluate perplexity = evaluate_perplexity(model, validation_data) print(f"50% sparse model: {perplexity} perplexity")

Prune-Then-Retrain

Prune once, then fine-tune on task-specific data for 1–3 epochs.

# 1. Prune model = wanda_prune(model, sparsity_level=0.6, calibration_data) # 2. Fine-tune on downstream task optimizer = AdamW(model.parameters(), lr=1e-4) for epoch in range(3): for batch in task_train_loader: logits = model(batch["input_ids"]) loss = F.cross_entropy(logits, batch["labels"]) loss.backward() optimizer.step() optimizer.zero_grad() # 3. Evaluate on downstream task accuracy = evaluate_downstream(model, task_val_loader) print(f"Task accuracy: {accuracy}")

Pruning + Distillation (Minitron Style)

Combine structured pruning with knowledge distillation for best quality.

# 1. Identify layers to prune (use gradient importance) important_layers = identify_important_layers(model, calibration_data) # 2. Create pruned architecture pruned_model = create_pruned_model(model, important_layers) # 3. Distill from teacher teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-15b") teacher_model.eval() kl_loss_fn = nn.KLDivLoss(reduction='batchmean') optimizer = AdamW(pruned_model.parameters(), lr=5e-5) for batch in distill_train_loader: student_logits = pruned_model(batch) with torch.no_grad(): teacher_logits = teacher_model(batch) # KL divergence loss at temperature T=3 T = 3 loss = kl_loss_fn( F.log_softmax(student_logits / T, dim=-1), F.softmax(teacher_logits / T, dim=-1) ) loss.backward() optimizer.step()

Pruning + Quantization (Maximum Compression)

Combine 50% sparsity + INT8 quantization for 8–12× total compression.

# 1. Prune to 50% sparsity model = wanda_prune(model, sparsity_level=0.5, calibration_data) # 2. Quantize to INT8 (using bitsandbytes) from bitsandbytes.nn import Linear8bitLt quantized_model = convert_to_8bit(model) # bitsandbytes # 3. Quick recalibration for batch in calibration_data: _ = quantized_model(batch) # Result: 50% sparse + 8-bit = 6.4× smaller, 4–6× faster print(f"Final size: {model.get_memory_footprint() / 1e9} GB")

Benchmarks

Real-world performance metrics across models and methods.

Perplexity vs Sparsity (LLaMA-7B, WikiText-2)

Sparsity Dense Baseline Magnitude WANDA SparseGPT
0% (Dense) 5.68
30% 5.95 5.81 5.78
50% 6.52 6.09 6.03
60% 7.41 6.65 6.51
70% 9.12 7.45 7.22
Key Insight
At 50% sparsity, WANDA and SparseGPT lose only ~7% in perplexity compared to dense. Magnitude pruning loses 15%. This gap grows with sparsity.

Quality Retention at Different Sparsity Levels

Sparsity Level Zero-Shot Accuracy (Avg) Task: MMLU Task: HellaSwag
Dense 100% 32% 78%
50% (WANDA) 99% 31% 77%
60% (WANDA) 97% 29% 74%
70% (WANDA) 91% 26% 68%
80% (WANDA) 75% 18% 52%

Speed Improvements with N:M Sparsity

Hardware 2:4 Sparsity Token/s (Dense) Token/s (Sparse) Speedup
A100 (40GB) Yes 450 920 2.04×
H100 (80GB) Yes 580 1450 2.50×
A10 GPU No 120 125 1.04×
CPU (AMD EPYC) No 5 6 1.2×

Model Size & Memory Savings

Model Original Size 50% Sparse 50% Sparse + INT8 Memory Saved
LLaMA-7B (FP16) 13 GB 6.5 GB 3.3 GB 75%
LLaMA-13B (FP16) 26 GB 13 GB 6.5 GB 75%
Llama 2-70B (FP16) 140 GB 70 GB 35 GB 75%

Model Directory

Pre-pruned models and base models suitable for pruning.

Pre-Pruned Models on Hugging Face

Model Base Model Pruning Method Sparsity Quality Link
Minitron-8B Nemotron-15B Structured + Distill 47% Matches 7B baseline nvidia/Minitron-8B
Minitron-4B Nemotron-15B Structured + Distill 73% Matches 3B baseline nvidia/Minitron-4B
Sparse-LLaMA-7B LLaMA-7B WANDA (unstructured) 50% 93% baseline quality vwxyzjn/sparse-llama-7b
Sheared-LLaMA LLaMA-7B Structured task-aware 20–30% Task-dependent princeton-nlp/sheared-llama-*

Recommended Base Models for Pruning

Llama 2 Series

  • meta-llama/Llama-2-7b
  • meta-llama/Llama-2-13b
  • meta-llama/Llama-2-70b
  • Easy to prune, well-documented

Mistral Series

  • mistralai/Mistral-7B-v0.1
  • mistralai/Mixtral-8x7B
  • Good efficiency before pruning

Phi Series

  • microsoft/phi-1.5
  • microsoft/phi-2
  • Small, prunes well

OLMo Series

  • allenai/OLMo-1B
  • allenai/OLMo-7B
  • Research models, open

Deployment

Running pruned models in production with optimized inference.

DeepSparse Engine (Neural Magic)

Optimized CPU inference for sparse models. Perfect for edge deployment without GPU.

from deepsparse import Pipeline # Load sparse model pipeline = Pipeline.create( task="text_generation", model_path="zoo:nlp/text_generation/openai-gpt2/pruned_quantized", engine_type="deepsparse" ) # Inference (optimized for sparsity) result = pipeline( prompt="The future of AI is", max_length=100 ) print(result.generations[0].text)

NVIDIA Sparse Tensor Cores (A100/H100)

Use structured (2:4) sparsity with NVIDIA's CUTLASS or TensorRT-LLM.

# Use TensorRT-LLM with sparse plugins import tensorrt_llm as trt_llm # Build sparse model engine builder = trt_llm.Builder() # (auto-detects 2:4 sparsity in weights) engine = builder.build_engine( model_path="./llama-7b-50-sparse.safetensors", precision="float16", max_batch_size=32 ) # Serialize and deploy engine.save("./llama-sparse.engine")

vLLM Integration

Use vLLM's PagedAttention with sparse models for efficient batching.

from vllm import LLM, SamplingParams # Load sparse model llm = LLM( model="sparse-llama-7b:50-percent", dtype="float16", tensor_parallel_size=1, gpu_memory_utilization=0.7 ) prompts = ["What is AI?", "How does ML work?"] sampling_params = SamplingParams(temperature=0.7, top_p=0.9) outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)

Custom Sparse Kernel Implementation

For ultimate control, implement custom sparse matrix-vector multiplication (e.g., in CUDA/Triton).

Performance Tip
Unstructured sparsity requires custom kernels for speedup. Most frameworks default to dense compute on unstructured sparse matrices, negating speedup. Prefer structured or 2:4 sparsity for guaranteed acceleration.

Cost Analysis

Economic impact of pruning in training and inference.

Training Cost Comparison

Method Time (7B Model) Compute Cost GPU Hours Cost @ $1/hr
WANDA (one-shot) 5 min 1× (calibration) 0.08 $0.08
SparseGPT (one-shot) 1 hour 1× (reconstruction) 1 $1.00
Iterative (5 rounds) 3–5 hours 5× (retraining) 20 $20.00
From-scratch training 20 hours 1× (full training) 20 $20.00

Inference Savings (Annual Estimate)

Scenario: 1M requests/day, 10 days retention

Model Setup A100 GPUs Needed Annual Cost (@ $50k/yr) Savings vs Dense
Dense LLaMA-7B 8 $400k
50% Sparse (WANDA) 4 $200k 50% savings
50% Sparse + Quantized 2 $100k 75% savings

Pruning vs Quantization Trade-offs

Technique Size Reduction Speedup Quality Loss Ease of Use
Quantization Alone (INT8) 2–3× 2–10% Easy
Pruning Alone (50%) 2–4× (with H/W support) 5–10% Medium
Both (50% Sparse + INT8) 4–6× 8–15% Hard

Cloud GPU Pricing Reference (March 2026)

GPU VRAM On-Demand $/hr Spot $/hr Pruning Use
A100 80GB 80GB $2.00-3.00 $1.00-1.80 WANDA/SparseGPT on 7-13B; 2:4 sparsity inference
H100 SXM 80GB $2.40-4.00 $1.50-2.50 N:M sparse tensor core acceleration; best throughput
4× A100 80GB 320GB $8.00-12.00 $4.00-7.00 Structured pruning + retraining for 70B models

Pruned Self-Host vs API — Total Cost (1M req/day, 500 tok/req)

Approach Monthly Cost Annual Cost Latency (P50) vs GPT-4o API
GPT-4o API $7,500 $90,000 500-2000ms Baseline
GPT-4o-mini API $1,125 $13,500 200-800ms 85% cheaper
Llama-8B Dense (A100) $2,160 $25,920 50-150ms 71% cheaper
Llama-8B 50% Sparse (A100, 2:4) $1,440 $17,280 25-80ms 81% cheaper
Llama-8B Sparse + INT4 (A40) $576 $6,912 30-100ms 92% cheaper

When to Prune vs Just Quantize

Prune If:

  • ✓ You have H/W with sparse support (A100/H100 tensor cores)
  • ✓ Model latency is critical (<50ms P99)
  • ✓ You need 4-6x combined speedup (sparse + quant)
  • ✓ You can spare 5 min (WANDA) to 5 hours per model
  • ✓ Serving cost is >$2K/month and you need deeper savings

Quantize Instead If:

  • ✓ You have no sparse H/W (consumer GPUs, CPU)
  • ✓ You need quick deployment (<1 hour)
  • ✓ Model size (VRAM) is the bottleneck, not throughput
  • ✓ 2-3x speedup is acceptable for your SLA
  • ✓ Quality sensitivity is very high (<1% loss tolerance)
Maximum Savings Recipe: WANDA 2:4 sparsity ($0.08, 5 min) + AWQ INT4 quantization ($0, 30 min) + vLLM serving on A40 ($576/mo) = 92% cheaper than GPT-4o API with 5-10x lower latency. One-time effort: under 1 hour. Payback: immediate.

Failure Modes & Mitigation

Common pitfalls and how to avoid them.

Over-Pruning Causing Catastrophic Loss

Problem
Pruning too aggressively (>80%) causes irreversible quality degradation. Model loses critical information capacity.

Mitigation Strategies

  • 1. Use progressive/iterative pruning instead of one-shot at high sparsity
  • 2. Validate perplexity at each pruning step
  • 3. Stay <70% sparsity unless using distillation
  • 4. Use importance-aware scoring (WANDA, not magnitude alone)

Layer Collapse in Structured Pruning

Problem
When pruning layers/neurons, some outputs can collapse to near-zero activation, breaking downstream layers.
# Detect layer collapse during pruning def detect_collapse(model, calibration_data): activations = model.get_layer_activations(calibration_data) for layer_name, act in activations.items(): # Check if output distribution is degenerate mean = act.abs().mean() std = act.std() if std / mean < 0.1: # Low variance print(f"WARNING: {layer_name} may be collapsing") return True return False

Mitigation Strategies

  • 1. Monitor layer output statistics during pruning
  • 2. Use activation-aware importance (WANDA uses this)
  • 3. Retrain for 2–3 epochs after structured pruning
  • 4. Preserve skip connections; don't prune residual paths

Sparsity Pattern Hardware Incompatibility

Problem
Unstructured sparsity is difficult to accelerate. Most hardware either doesn't support it or falls back to dense compute.

Mitigation Strategies

  • 1. Use 2:4 structured sparsity for NVIDIA GPUs
  • 2. Use DeepSparse Engine for CPU deployment
  • 3. Verify speedup in your target hardware before production
  • 4. Consider structured pruning instead if sparsity speedup fails

Retraining Instability After Aggressive Pruning

Problem
Post-pruning retraining can be unstable with high learning rates. Gradients explode or vanish in pruned layers.
# Stable retraining recipe optimizer = AdamW(model.parameters(), lr=1e-5) # 10× smaller than typical scheduler = CosineAnnealingLR(optimizer, T_max=10) for epoch in range(10): for batch in train_loader: logits = model(batch) loss = F.cross_entropy(logits, batch["labels"]) # Gradient clipping for stability loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() optimizer.zero_grad()

Mitigation Strategies

  • 1. Use 10× smaller learning rate for post-pruning retraining
  • 2. Apply gradient clipping (norm ≤ 1.0)
  • 3. Warm up learning rate gradually
  • 4. Use fewer epochs (2–5) but smaller batches
  • 5. Consider layer-wise adaptive retraining rates

Common Debugging Checklist

  • ☐ Is pruned model's speedup actually measured on target hardware?
  • ☐ Are layer outputs collapsing to near-zero?
  • ☐ Did you validate perplexity on calibration data?
  • ☐ Are weights being zeroed correctly (check sparsity %)?
  • ☐ Is retraining learning rate too high?
  • ☐ Did you preserve batch norm statistics post-pruning?
  • ☐ Are skip connections preserved in structured pruning?

Tools & Frameworks

Software ecosystem for LLM pruning.

LLM-Pruning Collection (Jan 2026)

LLM-Pruning via Language Models (LLM-P)

JAX-based pruning framework from Princeton Z-Lab. Supports WANDA, SparseGPT, structured pruning with unified API. Latest version (Jan 2026) includes N:M support.

GitHub: princeton-nlp/llm-pruning JAX Active

Neural Magic SparseML & DeepSparse

SparseML

PyTorch-based toolkit for creating, training, and fine-tuning sparse models. Easy integration with Hugging Face transformers.

from sparseml.transformers import SparseAutoModelForCausalLM model = SparseAutoModelForCausalLM.from_pretrained( "zoo:nlp/question_answering/bert/pruned" ) # Fine-tune sparse model trainer = Trainer(model=model, args=args, train_dataset=train_data) trainer.train()
GitHub: neuralmagic/sparseml PyTorch

DeepSparse

CPU inference engine optimized for sparse models. Drop-in replacement for PyTorch that accelerates sparse weights.

GitHub: neuralmagic/deepsparse Production-Ready

NVIDIA NeMo & Minitron

NVIDIA's NeMo framework includes structured pruning utilities and distillation. Minitron models are reference implementations.

from nemo.collections.nlp.models import MegatronGPTModel # Load Minitron model model = MegatronGPTModel.from_pretrained("nvidia/Minitron-8B") # Use with Apex for mixed precision model = model.half() outputs = model.forward(input_ids)
GitHub: NVIDIA/NeMo Enterprise

PyTorch Native torch.nn.utils.prune

Built-in PyTorch module for magnitude-based unstructured pruning. Good for simple use cases.

import torch.nn.utils.prune as prune # Prune 50% of weights in a layer prune.l1_unstructured(model.layer.weight, name="weight", amount=0.5) # Make sparsity permanent prune.remove(model.layer, "weight")
Built-in PyTorch Simple

Open-Source Implementations

Project Method Framework GitHub
WANDA Weights & Activations PyTorch locuslab/WANDA
SparseGPT Layer-wise reconstruction PyTorch IST-DM/SparseGPT
Sheared LLaMA Structured task-aware PyTorch princeton-nlp/sheared_llama
LLM-Pruner Task-agnostic structured PyTorch horseee/LLM-Pruner

Research Papers

Key publications in LLM pruning and sparsity.

Foundational Pruning

WANDA: Pruning by Weights and Activations (2024)

Sun et al., ICLR 2024

One-shot unstructured pruning using |w| × |a| importance scoring. No retraining, matches SparseGPT quality.

State-of-the-art one-shot ICLR 2024

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot (2023)

Frantar & Alistarh, ICML 2023

Layer-wise Hessian-based reconstruction. Handles 50–60% unstructured sparsity with high quality.

Influential ICML 2023

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (2019)

Frankle & Carbin, ICLR 2019

Random sparse networks can match dense accuracy. Foundation for understanding why pruning works.

Foundational ICLR 2019

Structured Pruning for LLMs

Minitron: Towards Efficient Vision Transformers (NVIDIA, 2024)

Structured pruning + distillation. Nemotron-15B → 8B/4B with superior quality.

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2024)

Xia et al.

Task-aware structured pruning of LLaMA-7B. 15–30% parameter reduction with task-specific gains.

LLM-Pruner: On the Structural Pruning of Large Language Models (2024)

Ma et al.

Task-agnostic structured pruning via gradient-based importance scoring. Works on any LLM architecture.

Related Topics

Quantization

QLoRA, GPTQ, AWQ – Often combined with pruning for maximum compression.

Knowledge Distillation

Minitron, DistilBERT – Best baseline for high-quality sparse models.

Reading List

Recommended Order
  1. Lottery Ticket Hypothesis (foundational intuition)
  2. SparseGPT (understand layer-wise approach)
  3. WANDA (modern one-shot method)
  4. Minitron (structured + distillation)
  5. LLM-Pruner (task-agnostic structured)

Glossary of Pruning Terms

16 key technical terms used throughout this guide, organized alphabetically.

2

TermDefinition
2:4 Sparsity (N:M)A semi-structured sparsity pattern where exactly 2 out of every 4 consecutive weights are zero. Supported natively by NVIDIA Ampere/Hopper tensor cores for 2× inference speedup.

D

TermDefinition
DeepSparseNeural Magic's inference engine optimized for running sparse models on CPUs, achieving GPU-like performance for pruned and quantized models without requiring GPU hardware.

G

TermDefinition
Global PruningRanking and removing weights across the entire model based on a single global importance threshold. Produces better results than layer-wise pruning but requires analyzing all weights simultaneously.

I

TermDefinition
Iterative PruningGradually increasing sparsity over multiple training rounds (prune → retrain → prune → retrain). Achieves higher sparsity with less quality loss than one-shot methods but costs more compute.

L

TermDefinition
Layer-wise PruningPruning each layer independently to a target sparsity ratio. Simpler than global pruning but may over-prune sensitive layers. Some methods (ShortGPT) identify and remove entire layers.
Lottery Ticket HypothesisThe theory (Frankle & Carlin, 2019) that dense networks contain sparse subnetworks ("winning tickets") that can match the full network's performance when trained in isolation from their initial weights.

M

TermDefinition
Magnitude PruningThe simplest pruning method: removing weights with the smallest absolute values. Fast but suboptimal — small weights may still be important for certain inputs.
MinitronNVIDIA's approach combining structured pruning with knowledge distillation. Compressed Nemotron 15B to 8B and 4B models that match or exceed other models in their size class.

O

TermDefinition
One-Shot PruningPruning to the target sparsity in a single step without retraining. Fast (minutes) but may have higher quality loss. WANDA and SparseGPT are the leading one-shot methods.

P

TermDefinition
PruningRemoving redundant weights, neurons, attention heads, or entire layers from a neural network to reduce size and increase speed. Can be unstructured (individual weights) or structured (entire components).

S

TermDefinition
ShortGPTA structured pruning method that identifies and removes entire Transformer layers based on their contribution to the output. Found that middle layers often contribute least.
SparseGPTA one-shot pruning method using approximate second-order reconstruction to optimally adjust remaining weights after pruning. Achieves 50-60% sparsity on LLMs with minimal quality loss in ~1 GPU-hour.
SparsityThe fraction of zero-valued weights in a model. 50% sparsity means half the weights are zero. Higher sparsity = more compression but more quality risk.
Structured PruningRemoving entire structural components (attention heads, neurons, layers, channels) rather than individual weights. Produces actual speedups on standard hardware without sparse matrix support.

U

TermDefinition
Unstructured PruningSetting individual weights to zero based on importance scores. Achieves higher compression ratios than structured pruning but requires specialized sparse hardware/software for actual speedups.

W

TermDefinition
WANDAWeights AND Activations — a one-shot pruning method that scores weights by (magnitude × input activation norm). Requires only 5 minutes and a small calibration set. No retraining needed. Published at ICLR 2024.
Full Reference: For a comprehensive glossary covering ALL LLM topics across all documents, see the unified LLM Glossary with 140+ terms.