LLM Pruning

Remove unnecessary weights and layers to build faster, leaner models without sacrificing quality.

50–90%

Sparsity Achieved

90–99%

Quality Retained

2–4×

Speedup

40–70%

Size Reduction

What is Pruning?

Structured removal of weights, neurons, layers, and attention heads from trained models to create sparse networks.

Pruning Types

Weight Pruning: Remove individual weights with smallest magnitude
Neuron Pruning: Eliminate entire neurons/neurons across all layers
Layer Pruning: Remove middle transformer layers completely
Attention Head Pruning: Drop less important multi-head attention heads

Why Pruning?

Smaller Models: 40–70% model size reduction
Faster Inference: 2–4× speedup on hardware with sparse support
Lower Latency: Fewer operations per token
Deployment: Fits on edge devices, mobile, browsers

Pruning works best when combined with quantization for maximum compression. A 50% sparse + 8-bit quantized model can match dense FP16 models while being 6-8× smaller.

Fundamentals

Core concepts and methods that power modern pruning.

Magnitude Pruning (Baseline)

Simplest pruning method: remove weights with smallest absolute value. Fast, but not always optimal.

# Basic magnitude pruning
import torch

weights = model.layer.weight.data
threshold = torch.quantile(torch.abs(weights), 0.5)
mask = torch.abs(weights) > threshold
weights *= mask  # Zero out small weights

Importance Scoring Methods

Magnitude-Based

Score = |w|. Fast but ignores input correlation.

Gradient-Based

Score = |∇L/∂w| × |w|. Uses loss sensitivity.

Hessian-Based

Score = H × |w|². Second-order approximation (expensive).

Activation-Based

Score = |w| × |activation|. WANDA approach (ICLR 2024).

Pruning Schedule

Method	Description	Pros	Cons
One-Shot	Prune once, no retraining	Fast, minimal compute	Lower quality at high sparsity
Iterative	Prune → retrain repeatedly	Higher quality, smoother	Expensive, slow
Gradual	Progressive sparsity increase during training	Good quality-cost tradeoff	Requires full retraining

Sparsity Patterns

Unstructured

Any weight can be pruned. Most flexible but harder to accelerate on hardware.

Structured

Prune entire neurons, heads, or layers. Hardware-friendly but less flexible.

Semi-Structured (N:M)

N zeros per M consecutive values (e.g., 2:4). Hardware acceleration available.

The Lottery Ticket Hypothesis

Key Finding

Dense neural networks contain subnetworks that, when isolated and trained in isolation, can match or exceed the performance of the full network. These "lottery tickets" are sparse and trainable from scratch.

Implication: a randomly initialized 50% sparse network can reach 99% of dense accuracy with proper training — suggesting over-parameterization is intentional redundancy.

Pruning vs. Distillation vs. Quantization

Technique	What It Does	Speedup	Quality Loss	Combined?
Pruning	Remove weights/layers	2–4×	5–20%	Yes with all
Quantization	Reduce precision (FP32→INT8)	2–3×	2–10%	Yes with all
Distillation	Train small model on large model outputs	Variable	0–5%	Best baseline

Unstructured Pruning

Remove individual weights to achieve maximum flexibility in sparsity patterns.

WANDA (Pruning by Weights AND Activations) — ICLR 2024

State-of-the-art for one-shot LLM pruning. Prunes weights with smallest magnitude × activation product.

Key Innovation: Per-output basis pruning without retraining. WANDA matches SparseGPT quality with 10× less compute.

def wanda_prune(model, sparsity_level, calibration_data):
    # For each layer output
    for layer in model.layers:
        weight = layer.weight  # shape: [out, in]
        activations = layer.get_activations(calibration_data)

        # Compute importance: |w| * |a|
        w_abs = torch.abs(weight)
        a_abs = torch.abs(activations).mean(dim=0)  # [in]
        importance = w_abs * a_abs.unsqueeze(0)

        # Find threshold for target sparsity
        threshold = torch.quantile(importance, sparsity_level)
        mask = importance > threshold

        # Apply pruning
        layer.weight.data *= mask.float()

    return model

✓ No Retraining Fast (minutes) Matches SparseGPT

SparseGPT

Layer-wise optimal brain surgeon with second-order approximation. Handles 50–60% unstructured sparsity well.

Method: For each layer, solve per-output reconstruction problem using inverse Hessian approximation. Rearranges remaining weights to minimize layer output error.

# Simplified SparseGPT reconstruction
for layer in model.layers:
    # Get Hessian inverse (expensive)
    H = layer.get_hessian_inverse(calibration_data)

    for output_idx in range(layer.out_features):
        w = layer.weight[output_idx]  # [in]

        # Find least important weights
        importance = torch.abs(w) / torch.sqrt(H.diagonal())
        prune_idx = importance.argsort()[:num_to_prune]

        # Update surviving weights to compensate
        w[prune_idx] = 0
        # ... reconstruction math ...

Method Comparison: Wanda vs SparseGPT vs Magnitude

Method	Scoring	Retraining	Speed	Quality @ 50% Sparse
WANDA	\|w\| × \|a\|	No	5 min (LLaMA-7B)	92–94% perplexity
SparseGPT	Hessian-based reconstruction	No	1–2 hours	93–95% perplexity
Magnitude	\|w\| only	No	<1 min	88–90% perplexity

Hardware Limitation

Unstructured sparsity requires sparse tensor operations. Most hardware (GPUs, TPUs) don't efficiently support arbitrary sparse patterns. NVIDIA GPUs offer limited sparse support; CPUs are even slower. Use structured pruning or N:M for practical deployment.

Structured Pruning

Remove entire neurons, layers, or attention heads for hardware-friendly sparsity.

Layer Removal (ShortGPT)

Remove middle transformer layers while keeping embedding and output layers. NVIDIA's Nemotron: 15B → 8B/4B.

# Remove middle layers and adjust attention
original_layers = 48
target_layers = 24
layers_to_remove = original_layers // 2

# Keep first 12, remove next 24, keep last 12
new_model = Model(
    layers=model.layers[:12] + model.layers[36:]
)

# Quick fine-tune on calibration data
fine_tune(new_model, calibration_data, epochs=3)

Attention Head Pruning

Remove least important heads based on activation magnitude or gradient flow. Typical: 5–15% of heads are redundant.

# Measure head importance
for layer in model.layers:
    multi_head = layer.self_attn

    # Score each head
    head_scores = []
    for h in range(num_heads):
        # Average attention weights across batch
        head_attn = multi_head.attn_output[h]
        score = head_attn.abs().mean()
        head_scores.append(score)

    # Remove bottom N% heads
    heads_to_remove = argsort(head_scores)[:5]
    multi_head.num_heads -= 5

Width Pruning

Reduce hidden dimensions (FFN inner dimension, embedding dimension). Less common but effective.

Minitron Approach (NVIDIA, 2024)

Structured Pruning + Knowledge Distillation

1. Identify and remove least important layers/heads
2. Fine-tune on diverse data
3. Distill from original Nemotron-15B
4. Result: Minitron-8B & 4B match or exceed other 7-8B models

Sheared LLaMA & LLM-Pruner

Sheared LLaMA: Structured pruning of LLaMA-7B tailored to specific downstream tasks, achieving 15–30% fewer parameters.

LLM-Pruner: Task-agnostic structured pruning using importance scores from gradient flow. Works on any LLM.

Structured Pruning Pros & Cons

Advantages

✓ Hardware-friendly (drop layers, narrow tensors)
✓ Easy to implement
✓ Works with standard training pipelines
✓ Minimal memory overhead

Disadvantages

✗ Less flexible (can't remove single weights)
✗ Layer collapse at high sparsity
✗ May require more retraining
✗ Quality-sparsity tradeoff worse than unstructured

N:M Sparsity

Semi-structured sparsity with hardware acceleration support.

N:M sparsity means exactly N zeros per M consecutive values. Most common: 2:4 (50% sparsity) and 4:8 (50% sparsity).

Hardware Support

Hardware	Pattern Support	Acceleration	Speedup @ 2:4
NVIDIA A100 (Ampere)	2:4, fine-grained	Sparse tensor cores	~2×
NVIDIA H100 (Hopper)	2:4, fine-grained	Sparse tensor cores	~2.5×
Intel Gaudi	Structured only	Limited	~1.2×
CPU	Any	None (dense compute)	~1×

WANDA Extended to N:M

Apply WANDA importance scores but enforce N:M constraint: for each group of M values, keep only N highest-importance ones.

def wanda_nm_prune(weight, activations, n, m):
    # Compute importance
    importance = torch.abs(weight) * activations.abs().mean(dim=0)

    # Reshape for N:M constraint
    out_features, in_features = weight.shape
    importance_reshaped = importance.view(out_features, in_features // m, m)

    # For each group, keep top N values
    mask = torch.zeros_like(importance_reshaped)
    for i in range(out_features):
        for j in range(in_features // m):
            group = importance_reshaped[i, j]
            top_indices = torch.topk(group, n)[1]
            mask[i, j, top_indices] = 1

    mask = mask.view_as(weight)
    return weight * mask

Trade-offs

2:4 Sparsity (50%)

✓ 2× speedup on A100/H100
✓ Minimal quality loss (1–3%)
✓ Good practical balance

4:8 Sparsity (50%)

✓ More aggressive sparsity pattern
✓ Quality loss similar to 2:4
✗ Less hardware support

Progressive Pruning

Iterative pruning with retraining for higher quality at extreme sparsity.

Iterative Magnitude Pruning (IMP)

Repeatedly prune, retrain, and reset to original magnitude.

def iterative_magnitude_pruning(model, target_sparsity, num_iterations):
    original_weights = {k: v.clone() for k, v in model.state_dict().items()}

    for iteration in range(num_iterations):
        # Prune
        pruning_amount = target_sparsity / num_iterations
        apply_magnitude_pruning(model, pruning_amount)

        # Retrain
        optimizer = AdamW(model.parameters(), lr=1e-4)
        for epoch in range(5):
            for batch in train_loader:
                loss = compute_loss(model, batch)
                loss.backward()
                optimizer.step()

        # Reset magnitude to original (lottery ticket)
        for name, param in model.named_parameters():
            mask = param.data != 0
            param.data[mask] = original_weights[name][mask]

    return model

Gradual Pruning During Training

Increase sparsity gradually during the full training run using polynomial schedules.

# Cubic pruning schedule
def cubic_schedule(step, total_steps, initial_sparsity, target_sparsity):
    progress = step / total_steps
    sparsity = (target_sparsity - initial_sparsity) * (progress ** 3) + initial_sparsity
    return sparsity

# During training loop
for step, batch in enumerate(train_loader):
    current_sparsity = cubic_schedule(step, total_steps, 0, 0.5)
    apply_pruning(model, current_sparsity)

    logits = model(batch)
    loss = compute_loss(logits, batch)
    loss.backward()
    optimizer.step()

Lottery Ticket Rewinding

After pruning, reset weights to their values from an early training checkpoint, then continue retraining.

Why It Works

Early training initializations often find good lottery tickets. Rewinding to these checkpoints + pruning can outperform training from scratch.

One-Shot vs Progressive Trade-offs

Method	Quality @ 50%	Quality @ 80%	Training Time	Practical Use
One-Shot (WANDA)	92%	75%	5 min	Quick iteration
Iterative (5 rounds)	94%	82%	3–5 hours	Higher quality target
Gradual (full training)	95%	85%	20 hours	From-scratch training

Training Recipes

Practical pipelines for different pruning scenarios.

Post-Training Pruning (No Retraining)

Best for quick results. Use WANDA or SparseGPT for quality one-shot pruning.

# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

# Calibrate on small sample (128 tokens × 8 samples)
calibration_data = load_calibration_data(n_samples=8, seq_len=128)

# Apply WANDA (5 minutes for 7B)
model = wanda_prune(model, sparsity_level=0.5, calibration_data)

# Evaluate
perplexity = evaluate_perplexity(model, validation_data)
print(f"50% sparse model: {perplexity} perplexity")

Prune-Then-Retrain

Prune once, then fine-tune on task-specific data for 1–3 epochs.

# 1. Prune
model = wanda_prune(model, sparsity_level=0.6, calibration_data)

# 2. Fine-tune on downstream task
optimizer = AdamW(model.parameters(), lr=1e-4)
for epoch in range(3):
    for batch in task_train_loader:
        logits = model(batch["input_ids"])
        loss = F.cross_entropy(logits, batch["labels"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# 3. Evaluate on downstream task
accuracy = evaluate_downstream(model, task_val_loader)
print(f"Task accuracy: {accuracy}")

Pruning + Distillation (Minitron Style)

Combine structured pruning with knowledge distillation for best quality.

# 1. Identify layers to prune (use gradient importance)
important_layers = identify_important_layers(model, calibration_data)

# 2. Create pruned architecture
pruned_model = create_pruned_model(model, important_layers)

# 3. Distill from teacher
teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-15b")
teacher_model.eval()

kl_loss_fn = nn.KLDivLoss(reduction='batchmean')
optimizer = AdamW(pruned_model.parameters(), lr=5e-5)

for batch in distill_train_loader:
    student_logits = pruned_model(batch)
    with torch.no_grad():
        teacher_logits = teacher_model(batch)

    # KL divergence loss at temperature T=3
    T = 3
    loss = kl_loss_fn(
        F.log_softmax(student_logits / T, dim=-1),
        F.softmax(teacher_logits / T, dim=-1)
    )
    loss.backward()
    optimizer.step()

Pruning + Quantization (Maximum Compression)

Combine 50% sparsity + INT8 quantization for 8–12× total compression.

# 1. Prune to 50% sparsity
model = wanda_prune(model, sparsity_level=0.5, calibration_data)

# 2. Quantize to INT8 (using bitsandbytes)
from bitsandbytes.nn import Linear8bitLt

quantized_model = convert_to_8bit(model)  # bitsandbytes

# 3. Quick recalibration
for batch in calibration_data:
    _ = quantized_model(batch)

# Result: 50% sparse + 8-bit = 6.4× smaller, 4–6× faster
print(f"Final size: {model.get_memory_footprint() / 1e9} GB")

Benchmarks

Real-world performance metrics across models and methods.

Perplexity vs Sparsity (LLaMA-7B, WikiText-2)

Sparsity	Dense Baseline	Magnitude	WANDA	SparseGPT
0% (Dense)	5.68	—	—	—
30%	—	5.95	5.81	5.78
50%	—	6.52	6.09	6.03
60%	—	7.41	6.65	6.51
70%	—	9.12	7.45	7.22

Key Insight

At 50% sparsity, WANDA and SparseGPT lose only ~7% in perplexity compared to dense. Magnitude pruning loses 15%. This gap grows with sparsity.

Quality Retention at Different Sparsity Levels

Sparsity Level	Zero-Shot Accuracy (Avg)	Task: MMLU	Task: HellaSwag
Dense	100%	32%	78%
50% (WANDA)	99%	31%	77%
60% (WANDA)	97%	29%	74%
70% (WANDA)	91%	26%	68%
80% (WANDA)	75%	18%	52%

Speed Improvements with N:M Sparsity

Hardware	2:4 Sparsity	Token/s (Dense)	Token/s (Sparse)	Speedup
A100 (40GB)	Yes	450	920	2.04×
H100 (80GB)	Yes	580	1450	2.50×
A10 GPU	No	120	125	1.04×
CPU (AMD EPYC)	No	5	6	1.2×

Model Size & Memory Savings

Model	Original Size	50% Sparse	50% Sparse + INT8	Memory Saved
LLaMA-7B (FP16)	13 GB	6.5 GB	3.3 GB	75%
LLaMA-13B (FP16)	26 GB	13 GB	6.5 GB	75%
Llama 2-70B (FP16)	140 GB	70 GB	35 GB	75%

Model Directory

Pre-pruned models and base models suitable for pruning.

Pre-Pruned Models on Hugging Face

Model	Base Model	Pruning Method	Sparsity	Quality	Link
Minitron-8B	Nemotron-15B	Structured + Distill	47%	Matches 7B baseline	nvidia/Minitron-8B
Minitron-4B	Nemotron-15B	Structured + Distill	73%	Matches 3B baseline	nvidia/Minitron-4B
Sparse-LLaMA-7B	LLaMA-7B	WANDA (unstructured)	50%	93% baseline quality	vwxyzjn/sparse-llama-7b
Sheared-LLaMA	LLaMA-7B	Structured task-aware	20–30%	Task-dependent	princeton-nlp/sheared-llama-*

Recommended Base Models for Pruning

Llama 2 Series

meta-llama/Llama-2-7b
meta-llama/Llama-2-13b
meta-llama/Llama-2-70b
Easy to prune, well-documented

Mistral Series

mistralai/Mistral-7B-v0.1
mistralai/Mixtral-8x7B
Good efficiency before pruning

Phi Series

microsoft/phi-1.5
microsoft/phi-2
Small, prunes well

OLMo Series

allenai/OLMo-1B
allenai/OLMo-7B
Research models, open

Deployment

Running pruned models in production with optimized inference.

DeepSparse Engine (Neural Magic)

Optimized CPU inference for sparse models. Perfect for edge deployment without GPU.

from deepsparse import Pipeline

# Load sparse model
pipeline = Pipeline.create(
    task="text_generation",
    model_path="zoo:nlp/text_generation/openai-gpt2/pruned_quantized",
    engine_type="deepsparse"
)

# Inference (optimized for sparsity)
result = pipeline(
    prompt="The future of AI is",
    max_length=100
)
print(result.generations[0].text)

NVIDIA Sparse Tensor Cores (A100/H100)

Use structured (2:4) sparsity with NVIDIA's CUTLASS or TensorRT-LLM.

# Use TensorRT-LLM with sparse plugins
import tensorrt_llm as trt_llm

# Build sparse model engine
builder = trt_llm.Builder()
# (auto-detects 2:4 sparsity in weights)
engine = builder.build_engine(
    model_path="./llama-7b-50-sparse.safetensors",
    precision="float16",
    max_batch_size=32
)

# Serialize and deploy
engine.save("./llama-sparse.engine")

vLLM Integration

Use vLLM's PagedAttention with sparse models for efficient batching.

from vllm import LLM, SamplingParams

# Load sparse model
llm = LLM(
    model="sparse-llama-7b:50-percent",
    dtype="float16",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.7
)

prompts = ["What is AI?", "How does ML work?"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Custom Sparse Kernel Implementation

For ultimate control, implement custom sparse matrix-vector multiplication (e.g., in CUDA/Triton).

Performance Tip

Unstructured sparsity requires custom kernels for speedup. Most frameworks default to dense compute on unstructured sparse matrices, negating speedup. Prefer structured or 2:4 sparsity for guaranteed acceleration.

Cost Analysis

Economic impact of pruning in training and inference.

Training Cost Comparison

Method	Time (7B Model)	Compute Cost	GPU Hours	Cost @ $1/hr
WANDA (one-shot)	5 min	1× (calibration)	0.08	$0.08
SparseGPT (one-shot)	1 hour	1× (reconstruction)	1	$1.00
Iterative (5 rounds)	3–5 hours	5× (retraining)	20	$20.00
From-scratch training	20 hours	1× (full training)	20	$20.00

Inference Savings (Annual Estimate)

Scenario: 1M requests/day, 10 days retention

Model Setup	A100 GPUs Needed	Annual Cost (@ $50k/yr)	Savings vs Dense
Dense LLaMA-7B	8	$400k	—
50% Sparse (WANDA)	4	$200k	50% savings
50% Sparse + Quantized	2	$100k	75% savings

Pruning vs Quantization Trade-offs

Technique	Size Reduction	Speedup	Quality Loss	Ease of Use
Quantization Alone (INT8)	4×	2–3×	2–10%	Easy
Pruning Alone (50%)	2×	2–4× (with H/W support)	5–10%	Medium
Both (50% Sparse + INT8)	8×	4–6×	8–15%	Hard

Cloud GPU Pricing Reference (March 2026)

GPU	VRAM	On-Demand $/hr	Spot $/hr	Pruning Use
A100 80GB	80GB	$2.00-3.00	$1.00-1.80	WANDA/SparseGPT on 7-13B; 2:4 sparsity inference
H100 SXM	80GB	$2.40-4.00	$1.50-2.50	N:M sparse tensor core acceleration; best throughput
4× A100 80GB	320GB	$8.00-12.00	$4.00-7.00	Structured pruning + retraining for 70B models

Pruned Self-Host vs API — Total Cost (1M req/day, 500 tok/req)

Approach	Monthly Cost	Annual Cost	Latency (P50)	vs GPT-4o API
GPT-4o API	$7,500	$90,000	500-2000ms	Baseline
GPT-4o-mini API	$1,125	$13,500	200-800ms	85% cheaper
Llama-8B Dense (A100)	$2,160	$25,920	50-150ms	71% cheaper
Llama-8B 50% Sparse (A100, 2:4)	$1,440	$17,280	25-80ms	81% cheaper
Llama-8B Sparse + INT4 (A40)	$576	$6,912	30-100ms	92% cheaper

When to Prune vs Just Quantize

Prune If:

✓ You have H/W with sparse support (A100/H100 tensor cores)
✓ Model latency is critical (<50ms P99)
✓ You need 4-6x combined speedup (sparse + quant)
✓ You can spare 5 min (WANDA) to 5 hours per model
✓ Serving cost is >$2K/month and you need deeper savings

Quantize Instead If:

✓ You have no sparse H/W (consumer GPUs, CPU)
✓ You need quick deployment (<1 hour)
✓ Model size (VRAM) is the bottleneck, not throughput
✓ 2-3x speedup is acceptable for your SLA
✓ Quality sensitivity is very high (<1% loss tolerance)

Maximum Savings Recipe: WANDA 2:4 sparsity ($0.08, 5 min) + AWQ INT4 quantization ($0, 30 min) + vLLM serving on A40 ($576/mo) = 92% cheaper than GPT-4o API with 5-10x lower latency. One-time effort: under 1 hour. Payback: immediate.

Failure Modes & Mitigation

Common pitfalls and how to avoid them.

Over-Pruning Causing Catastrophic Loss

Problem

Pruning too aggressively (>80%) causes irreversible quality degradation. Model loses critical information capacity.

Mitigation Strategies

1. Use progressive/iterative pruning instead of one-shot at high sparsity
2. Validate perplexity at each pruning step
3. Stay <70% sparsity unless using distillation
4. Use importance-aware scoring (WANDA, not magnitude alone)

Layer Collapse in Structured Pruning

Problem

When pruning layers/neurons, some outputs can collapse to near-zero activation, breaking downstream layers.

# Detect layer collapse during pruning
def detect_collapse(model, calibration_data):
    activations = model.get_layer_activations(calibration_data)

    for layer_name, act in activations.items():
        # Check if output distribution is degenerate
        mean = act.abs().mean()
        std = act.std()

        if std / mean < 0.1:  # Low variance
            print(f"WARNING: {layer_name} may be collapsing")
            return True

    return False

Mitigation Strategies

1. Monitor layer output statistics during pruning
2. Use activation-aware importance (WANDA uses this)
3. Retrain for 2–3 epochs after structured pruning
4. Preserve skip connections; don't prune residual paths

Sparsity Pattern Hardware Incompatibility

Problem

Unstructured sparsity is difficult to accelerate. Most hardware either doesn't support it or falls back to dense compute.

Mitigation Strategies

1. Use 2:4 structured sparsity for NVIDIA GPUs
2. Use DeepSparse Engine for CPU deployment
3. Verify speedup in your target hardware before production
4. Consider structured pruning instead if sparsity speedup fails

Retraining Instability After Aggressive Pruning

Problem

Post-pruning retraining can be unstable with high learning rates. Gradients explode or vanish in pruned layers.

# Stable retraining recipe
optimizer = AdamW(model.parameters(), lr=1e-5)  # 10× smaller than typical
scheduler = CosineAnnealingLR(optimizer, T_max=10)

for epoch in range(10):
    for batch in train_loader:
        logits = model(batch)
        loss = F.cross_entropy(logits, batch["labels"])

        # Gradient clipping for stability
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Mitigation Strategies

1. Use 10× smaller learning rate for post-pruning retraining
2. Apply gradient clipping (norm ≤ 1.0)
3. Warm up learning rate gradually
4. Use fewer epochs (2–5) but smaller batches
5. Consider layer-wise adaptive retraining rates

Common Debugging Checklist

☐ Is pruned model's speedup actually measured on target hardware?
☐ Are layer outputs collapsing to near-zero?
☐ Did you validate perplexity on calibration data?
☐ Are weights being zeroed correctly (check sparsity %)?
☐ Is retraining learning rate too high?
☐ Did you preserve batch norm statistics post-pruning?
☐ Are skip connections preserved in structured pruning?

Tools & Frameworks

Software ecosystem for LLM pruning.

LLM-Pruning Collection (Jan 2026)

LLM-Pruning via Language Models (LLM-P)

JAX-based pruning framework from Princeton Z-Lab. Supports WANDA, SparseGPT, structured pruning with unified API. Latest version (Jan 2026) includes N:M support.

GitHub: princeton-nlp/llm-pruning JAX Active

Neural Magic SparseML & DeepSparse

SparseML

PyTorch-based toolkit for creating, training, and fine-tuning sparse models. Easy integration with Hugging Face transformers.

from sparseml.transformers import SparseAutoModelForCausalLM

model = SparseAutoModelForCausalLM.from_pretrained(
    "zoo:nlp/question_answering/bert/pruned"
)

# Fine-tune sparse model
trainer = Trainer(model=model, args=args, train_dataset=train_data)
trainer.train()

GitHub: neuralmagic/sparseml PyTorch

DeepSparse

CPU inference engine optimized for sparse models. Drop-in replacement for PyTorch that accelerates sparse weights.

GitHub: neuralmagic/deepsparse Production-Ready

NVIDIA NeMo & Minitron

NVIDIA's NeMo framework includes structured pruning utilities and distillation. Minitron models are reference implementations.

from nemo.collections.nlp.models import MegatronGPTModel

# Load Minitron model
model = MegatronGPTModel.from_pretrained("nvidia/Minitron-8B")

# Use with Apex for mixed precision
model = model.half()
outputs = model.forward(input_ids)

GitHub: NVIDIA/NeMo Enterprise

PyTorch Native torch.nn.utils.prune

Built-in PyTorch module for magnitude-based unstructured pruning. Good for simple use cases.

import torch.nn.utils.prune as prune

# Prune 50% of weights in a layer
prune.l1_unstructured(model.layer.weight, name="weight", amount=0.5)

# Make sparsity permanent
prune.remove(model.layer, "weight")

Built-in PyTorch Simple

Open-Source Implementations

Project	Method	Framework	GitHub
WANDA	Weights & Activations	PyTorch	locuslab/WANDA
SparseGPT	Layer-wise reconstruction	PyTorch	IST-DM/SparseGPT
Sheared LLaMA	Structured task-aware	PyTorch	princeton-nlp/sheared_llama
LLM-Pruner	Task-agnostic structured	PyTorch	horseee/LLM-Pruner

Research Papers

Key publications in LLM pruning and sparsity.

Foundational Pruning

WANDA: Pruning by Weights and Activations (2024)

Sun et al., ICLR 2024

One-shot unstructured pruning using |w| × |a| importance scoring. No retraining, matches SparseGPT quality.

State-of-the-art one-shot ICLR 2024

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot (2023)

Frantar & Alistarh, ICML 2023

Layer-wise Hessian-based reconstruction. Handles 50–60% unstructured sparsity with high quality.

Influential ICML 2023

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (2019)

Frankle & Carbin, ICLR 2019

Random sparse networks can match dense accuracy. Foundation for understanding why pruning works.

Foundational ICLR 2019

Structured Pruning for LLMs

Minitron: Towards Efficient Vision Transformers (NVIDIA, 2024)

Structured pruning + distillation. Nemotron-15B → 8B/4B with superior quality.

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2024)

Xia et al.

Task-aware structured pruning of LLaMA-7B. 15–30% parameter reduction with task-specific gains.

LLM-Pruner: On the Structural Pruning of Large Language Models (2024)

Ma et al.

Task-agnostic structured pruning via gradient-based importance scoring. Works on any LLM architecture.

Reading List

Recommended Order

Lottery Ticket Hypothesis (foundational intuition)
SparseGPT (understand layer-wise approach)
WANDA (modern one-shot method)
Minitron (structured + distillation)
LLM-Pruner (task-agnostic structured)

Glossary of Pruning Terms

16 key technical terms used throughout this guide, organized alphabetically.

2

Term	Definition
2:4 Sparsity (N:M)	A semi-structured sparsity pattern where exactly 2 out of every 4 consecutive weights are zero. Supported natively by NVIDIA Ampere/Hopper tensor cores for 2× inference speedup.

D

Term	Definition
DeepSparse	Neural Magic's inference engine optimized for running sparse models on CPUs, achieving GPU-like performance for pruned and quantized models without requiring GPU hardware.

G

Term	Definition
Global Pruning	Ranking and removing weights across the entire model based on a single global importance threshold. Produces better results than layer-wise pruning but requires analyzing all weights simultaneously.

I

Term	Definition
Iterative Pruning	Gradually increasing sparsity over multiple training rounds (prune → retrain → prune → retrain). Achieves higher sparsity with less quality loss than one-shot methods but costs more compute.

L

Term	Definition
Layer-wise Pruning	Pruning each layer independently to a target sparsity ratio. Simpler than global pruning but may over-prune sensitive layers. Some methods (ShortGPT) identify and remove entire layers.
Lottery Ticket Hypothesis	The theory (Frankle & Carlin, 2019) that dense networks contain sparse subnetworks ("winning tickets") that can match the full network's performance when trained in isolation from their initial weights.

M

Term	Definition
Magnitude Pruning	The simplest pruning method: removing weights with the smallest absolute values. Fast but suboptimal — small weights may still be important for certain inputs.
Minitron	NVIDIA's approach combining structured pruning with knowledge distillation. Compressed Nemotron 15B to 8B and 4B models that match or exceed other models in their size class.

O

Term	Definition
One-Shot Pruning	Pruning to the target sparsity in a single step without retraining. Fast (minutes) but may have higher quality loss. WANDA and SparseGPT are the leading one-shot methods.

P

Term	Definition
Pruning	Removing redundant weights, neurons, attention heads, or entire layers from a neural network to reduce size and increase speed. Can be unstructured (individual weights) or structured (entire components).

S

Term	Definition
ShortGPT	A structured pruning method that identifies and removes entire Transformer layers based on their contribution to the output. Found that middle layers often contribute least.
SparseGPT	A one-shot pruning method using approximate second-order reconstruction to optimally adjust remaining weights after pruning. Achieves 50-60% sparsity on LLMs with minimal quality loss in ~1 GPU-hour.
Sparsity	The fraction of zero-valued weights in a model. 50% sparsity means half the weights are zero. Higher sparsity = more compression but more quality risk.
Structured Pruning	Removing entire structural components (attention heads, neurons, layers, channels) rather than individual weights. Produces actual speedups on standard hardware without sparse matrix support.

U

Term	Definition
Unstructured Pruning	Setting individual weights to zero based on importance scores. Achieves higher compression ratios than structured pruning but requires specialized sparse hardware/software for actual speedups.

W

Term	Definition
WANDA	Weights AND Activations — a one-shot pruning method that scores weights by (magnitude × input activation norm). Requires only 5 minutes and a small calibration set. No retraining needed. Published at ICLR 2024.

Full Reference: For a comprehensive glossary covering ALL LLM topics across all documents, see the unified LLM Glossary with 140+ terms.