LLM Quantization — Compression Techniques & Production Deployment
Comprehensive guide to quantization methods for large language models — from precision reduction through GPTQ, AWQ, GGUF to production inference at 2-8x compression with 90-99% quality retention.
A comprehensive guide to quantization techniques for large language models. Learn precision reduction, weight quantization, post-training & quantization-aware approaches, and production deployment patterns with vLLM, llama.cpp, and TensorRT-LLM.
What is Quantization?
Reducing Precision of Weights and Activations
What is Quantization?
Quantization reduces the precision of model weights and activations from floating-point (FP32/FP16) to lower-bit integer representations (INT8/INT4). This dramatically reduces model size and memory, enabling faster inference and lower costs while maintaining 90-99% of original quality.
Why Quantization Matters
- ✓ 2-8x memory reduction
- ✓ 2-10x inference speedup
- ✓ 50-80% cost savings
- ✓ Deploy to edge & mobile
- ✓ Enable real-time inference
Modern quantization methods (GPTQ, AWQ) preserve 90-99% of model quality while achieving 8x compression. This makes them essential for production LLM deployment where cost and latency are critical.
Quantization Fundamentals
Core Concepts & Approaches
Post-Training Quantization (PTQ)
Apply quantization after training is complete. No retraining needed. Calibrate on small dataset (100-1000 samples). Fast but may have quality loss.
Best for: Quick deployment, limited compute
Quantization-Aware Training (QAT)
Simulate quantization during training. Model learns to compensate for precision loss. Higher quality but requires full retraining.
Best for: High-quality models, custom training
Weight-Only Quantization
Quantize only weights; keep activations in FP32/FP16. Simpler to implement, minimal accuracy loss, still achieves 4-8x speedup.
Weight+Activation Quantization
Quantize both weights and activations. Maximum speedup, more quality loss. Requires careful calibration.
Quantization Types
Symmetric Quantization
Range is symmetric around zero. Formula: q = round(x / scale)
Asymmetric Quantization
Range may not be symmetric. Formula: q = round(x/scale + zero_point)
Per-Tensor Quantization
One scale factor per entire tensor. Fast but less accurate for heterogeneous values.
Per-Channel Quantization
Different scale per channel. Better quality but higher overhead. Popular for weights.
Core Formulas
Quantization: q = clamp(round(x / scale + zero_point), q_min, q_max)
Dequantization: x' = (q - zero_point) × scale
Scale: scale = (max - min) / (2^bits - 1)
Zero Point: zero_point = -round(min / scale)
Bits: 4 (INT4) or 8 (INT8). Lower bits = more compression but more quality loss. Calibration: Choose representative data. Group size: Per-group (e.g., 128) balances quality and speed. Outliers: Special handling for extreme values.
GPTQ: GPU-Quantized LLMs
Efficient 4-bit Quantization via Hessian Information
What is GPTQ?
One-shot weight quantization using approximate second-order info (Hessian). Quantizes one layer at a time with optimal rounding. Achieves 4-bit quantization with minimal accuracy loss (< 2%).
Key Advantages
- ✓ 4-bit quantization (8x compression)
- ✓ <2% accuracy loss
- ✓ Fast inference on GPUs
- ✓ No retraining needed
- ✓ Works with any model
How GPTQ Works
Step 1: Layer-by-layer quantization. Step 2: For each weight, find optimal quantized value using Hessian (inverse of Fisher information matrix).
Step 3: Minimize reconstruction error: ||W - Q||_H^2 where H is Hessian-weighted. Step 4: Update subsequent weights to compensate.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
# Configuration
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Per-group size
desc_act=True, # Desc activation
static_groups=False,
)
# Load and quantize
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config=quantize_config,
device_map="auto"
)
# Calibrate on data
calibration_dataset = [...] # 100-500 samples
model.quantize(calibration_dataset)
# Inference
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
| Metric | GPTQ | AWQ | GGUF (Q4_K) |
|---|---|---|---|
| Compression | 8x | 8x | 8x |
| Quality Loss | <2% | <1% | 2-3% |
| GPU Speedup (fp16→q4) | 2.6x | 3.2x | N/A |
| Calibration Time | 1-5 hours | 10-30 min | 30-60 min |
| Best For | GPU inference | High-throughput serving | CPU/edge |
Calibration: Use 100-500 diverse samples from your domain. Group size: 128 is typical. Smaller (32) = better quality, larger (256) = faster. Hardware: Quantize on same hardware as inference (GPU type matters).
AWQ: Activation-Aware Quantization
Protecting Salient Weights Based on Activation Distributions
What is AWQ?
Activation-aware weight quantization. Analyzes activation distributions to identify which weights are most important. Protects salient weights with higher precision, quantizes others aggressively.
Advantages over GPTQ
- ✓ <1% accuracy loss
- ✓ 10-30 min calibration
- ✓ Less calibration data needed
- ✓ 3.2x GPU speedup (better kernels)
- ✓ Best for vLLM serving
How AWQ Works
AWQ searches for weight scaling factors that minimize quantization error weighted by activation magnitudes. For weights with high activations, use finer quantization. For low-activation weights, quantize aggressively. Optimal scaling factors are computed per-channel via iterative optimization.
from awq import AutoAWQForCausalLM
# Load model
model = AutoAWQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf"
)
# Quantize config
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# Calibration and quantization
calibration_dataset = [...] # 32-256 samples of your data
model.quantize(
calibration_dataset,
quant_config=quant_config,
n_parallel_calib_batches=8
)
# Save quantized model
model.save_quantized("./llama2-7b-awq")
Marlin-AWQ kernels: 741 tokens/sec (Llama-2-7B on H100). 10.9x speedup vs FP16. Best for high-throughput inference with vLLM. Combine with batching for maximum throughput.
GGUF: Format for Quantized Models
CPU & Apple Silicon Optimized Quantization Levels
What is GGUF?
GGUF (GGML Universal Format) is a file format for quantized models evolved from GGML. Supports multiple quantization levels (Q2_K to Q8_0) optimized for CPU and Apple Silicon. Used by llama.cpp and Ollama.
Key Features
- ✓ Multiple quantization levels
- ✓ CPU-optimized inference
- ✓ Apple Silicon M1-M4 support
- ✓ Edge & local deployment
- ✓ Ollama integration
GGUF Quantization Levels
| Level | Bits | Size (7B model) | Quality | Speed | Use Case |
|---|---|---|---|---|---|
| Q2_K | 2-3 | ~800MB | Poor (80-85%) | Very Fast | Extreme edge, IoT |
| Q3_K_S | 3 | ~1GB | Fair (85-90%) | Very Fast | Phone, edge |
| Q3_K_M | 3 | ~1.2GB | Good (88-92%) | Fast | Mobile edge |
| Q4_K_S | 4 | ~1.5GB | Very Good (92-95%) | Fast | Consumer laptops |
| Q4_K_M | 4 | ~1.8GB | Excellent (95-97%) | Medium | Recommended default |
| Q5_K_S | 5 | ~2.2GB | Excellent (97-98%) | Medium | Desktop |
| Q6_K | 6 | ~2.7GB | Excellent (98%+) | Slow | High quality local |
| Q8_0 | 8 | ~3.5GB | Near-lossless (99%+) | Slow | Reference |
Using GGUF with llama.cpp
# Download GGUF model (from HF or Ollama)
curl -o llama2-7b-q4_k_m.gguf \
https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/...
# Run inference with llama.cpp
./main -m llama2-7b-q4_k_m.gguf \
-p "The answer to life is" \
-n 32 \
-t 4 \
-ngl 0 # 0 = CPU only, 32+ = GPU layers
# Or use Ollama (one-liner)
ollama run TheBloke/llama2-7b-q4_k_m
Default choice: Q4_K_M balances quality (95-97%) and size. For high quality: Q5_K_S or Q6_K. For mobile: Q3_K_M or Q4_K_S. For edge IoT: Q2_K or Q3_K_S.
QAT vs PTQ: Training & Inference
When to Quantize-Aware vs Post-Training
Quantization-Aware Training (QAT)
Simulate quantization during training. Model learns to minimize quantization noise. Higher quality but requires full retraining and more compute.
Pros: <1% quality loss, better for aggressive quantization
Cons: 5-100x more compute time
Post-Training Quantization (PTQ)
Apply quantization after training. Quick, no retraining. Calibrate on small dataset. May have 2-5% quality loss.
Pros: Fast (hours, not weeks), uses pretrained weights
Cons: Up to 5% quality loss for aggressive quantization
| Aspect | QAT | PTQ | Recommendation |
|---|---|---|---|
| Calibration Data | 10K-100K samples | 100-1000 samples | PTQ (less data) |
| Time | 1-7 days | 1-5 hours | PTQ (50-100x faster) |
| Quality Loss | <1% | 2-5% | QAT (better) |
| GPU Memory | 32-80GB | 8-16GB | PTQ (cheaper) |
| 4-bit Feasibility | Good | Good (8bit better) | Both work |
| Production Use | Critical models | General deployment | PTQ standard |
PyTorch QAT Example
import torch
import torch.quantization as tq
# Prepare model for QAT
model.qconfig = tq.get_default_qat_qconfig('fbgemm')
tq.prepare_qat(model, inplace=True)
# Train with quantization simulation
for epoch in range(5):
for batch in train_loader:
optimizer.zero_grad()
outputs = model(batch["input_ids"])
loss = criterion(outputs.logits, batch["labels"])
loss.backward()
optimizer.step()
# Convert to quantized model
tq.convert(model, inplace=True)
# Now weights are INT8, inference is fast
outputs = model(test_inputs)
Use PTQ if: Time is critical, model is general-purpose, acceptable 2-5% quality loss. Use QAT if: Model is mission-critical, willing to invest weeks, need <1% loss, doing aggressive 2-4 bit quantization.
Hardware Guide for Quantized Models
Choosing Quantization Methods by Platform
NVIDIA GPUs (A100/H100)
Best: GPTQ, AWQ (INT4/INT8)
Kernels: Marlin, CUTLASS
Speed: 2-10x over FP16
Use: High-throughput serving
CPU (x86)
Best: GGUF, INT8
Kernels: AVX-512, AMX (Intel)
Speed: 2-4x over FP32
Use: Local inference
Apple Silicon (M1-M4)
Best: GGUF, QAT
Kernels: Metal, ANE
Speed: 2-8x due to unified memory
Use: On-device macOS/iOS
Mobile (Snapdragon)
Best: INT4, Q3_K
Kernels: Hexagon DSP
Speed: 3-6x vs FP32
Use: On-device Android
ARM Cortex (Edge)
Best: GGUF Q2-Q4
Kernels: NEON, SVE
Speed: 2-4x over FP32
Use: IoT, edge devices
TPU (Google)
Best: INT8, INT4
Kernels: Native INT8
Speed: 4-8x advantage
Use: Google Cloud
Hardware × Quantization Compatibility Matrix
| Hardware | GPTQ | AWQ | GGUF | INT8 QAT | Recommended |
|---|---|---|---|---|---|
| A100/H100 (GPU) | ✓ Excellent | ✓ Best | ○ Slow | ✓ Good | AWQ |
| RTX 4090 (Gaming GPU) | ✓ Good | ✓ Good | ○ Slow | ✓ Fair | GPTQ |
| CPU (Intel/AMD) | ✗ Slow | ✗ Slow | ✓ Good | ✓ Good | GGUF |
| Apple Silicon | ✗ Not optimized | ✗ Not optimized | ✓ Excellent | ✓ Good | GGUF |
| Mobile (Snapdragon) | ✗ No | ✗ No | ✓ OK | ✓ Good | INT4 QAT |
| IoT/Edge (ARM) | ✗ No | ✗ No | ✓ Good | ✓ Good | GGUF |
GPU Server (NVIDIA): AWQ. CPU Local: GGUF Q4_K_M. Apple Mac/iPhone: GGUF. Mobile Android: INT4 QAT. Edge IoT: GGUF Q2-Q3.
Quantization Benchmarks
Quality, Speed & Memory Trade-offs
Perplexity: Quality Retention Across Methods
| Model | FP32 PPL | GPTQ (Q4) | AWQ (Q4) | GGUF (Q4_K) | QAT (INT8) |
|---|---|---|---|---|---|
| Llama-2-7B | 10.63 | 10.81 (+1.7%) | 10.68 (+0.5%) | 10.92 (+2.7%) | 10.65 (+0.2%) |
| Mistral-7B | 8.22 | 8.41 (+2.3%) | 8.27 (+0.6%) | 8.55 (+4%) | 8.24 (+0.2%) |
| Llama-3-8B | 9.14 | 9.35 (+2.3%) | 9.18 (+0.4%) | 9.48 (+3.7%) | 9.16 (+0.2%) |
| Key Finding: All methods stay within 6% of baseline. AWQ/QAT best (<1% loss). GGUF acceptable (2-4% loss) for edge. | |||||
Throughput (tokens/sec) - Llama-2-7B on H100
| Method | Tokens/sec | vs FP16 Baseline | Batch Size | Notes |
|---|---|---|---|---|
| FP16 (baseline) | 230 | 1.0x | 32 | Full precision |
| GPTQ (Q4) | 590 | 2.6x | 64 | Standard GPTQ kernels |
| Marlin-GPTQ (Q4) | 710 | 3.1x | 128 | Optimized GPU kernels |
| AWQ (Q4) | 650 | 2.8x | 64 | Activation-aware |
| Marlin-AWQ (Q4) | 741 | 3.2x | 128 | Best production throughput |
| INT8 QAT | 520 | 2.3x | 32 | General hardware support |
| Key Finding: Marlin kernels provide 10.9x (AWQ) and 2.6x (GPTQ) speedups. Best for high-throughput serving. | ||||
Memory Usage & Inference Speed
| Format | Model Size (7B) | Memory Peak (Inference) | vs FP32 | CPU Inference (ms) |
|---|---|---|---|---|
| FP32 | 28GB | 32GB | 1.0x | 850 |
| FP16 | 14GB | 16GB | 0.5x | 500 |
| GPTQ INT4 | 3.5GB | 6GB | 0.19x | 120 |
| GGUF Q4_K_M | 1.8GB | 4GB | 0.13x | 200 |
| GGUF Q3_K_M | 1.2GB | 3GB | 0.09x | 280 |
Quality: All methods achieve 90-99% of baseline quality. Speed: GPU quantization 2.6-3.2x faster. Marlin kernels are best. Memory: INT4 is 8x smaller than FP32. GGUF optimal for CPU.
Quantized Model Directory
Popular Models & Where to Find Them
TheBloke Collections (HuggingFace)
| Model Name | GPTQ (4-bit) | AWQ (4-bit) | GGUF | Quality | Notes |
|---|---|---|---|---|---|
| Llama-2-7B | ✓ | ✓ | ✓ | 97-98% | Most popular, well-tested |
| Mistral-7B | ✓ | ✓ | ✓ | 96-97% | Sliding window, fast inference |
| Llama-3-8B | ✓ | ✓ | ✓ | 97%+ | Latest, strong performance |
| Qwen-2-7B | ✓ | ✓ | ✓ | 96%+ | Multilingual, strong reasoning |
| Mixtral-8x7B | ✓ | ✓ | ✓ | 98% | MoE, selective expert activation |
| Llama-3-70B | ✓ | ✓ | ✓ | 99%+ | Large, high quality |
GGUF Models on Ollama & HuggingFace
Popular GGUF Collections
- • llama.cpp - Original (on GitHub)
- • Ollama - Easiest (ollama.ai)
- • TheBloke - Comprehensive (HF)
- • xBITx - Compact models (HF)
- • mradermacher - Audio/multimodal
Installation (Ollama)
ollama pull llama2:7b
ollama pull mistral
ollama pull neural-chat
ollama run llama2:7b
Done! Ready to use.
Model Quantization Quality Tiers
Local/Mobile (Q2-Q4)
Q2_K: 800MB, poor quality
Q3_K_M: 1.2GB, fair quality
Q4_K_M: 1.8GB, very good
Server/Desktop (Q4-Q8)
Q5_K_S: 2.2GB, excellent
Q6_K: 2.7GB, high quality
Q8_0: 3.5GB, near-lossless
Want to try now? ollama run llama2:7b-q4 (takes 2 mins). GPU server? Use TheBloke GPTQ or AWQ. Local CPU? Download Q4_K_M GGUF. All models on HuggingFace.
Deployment Strategies
vLLM, llama.cpp, TensorRT-LLM & Triton
vLLM + AWQ/GPTQ
High-throughput serving on NVIDIA GPUs. Paged attention, continuous batching. Marlin kernels for INT4.
pip install vllm
python -m vllm.entrypoints.openai_compatible_server \
--model TheBloke/Llama-2-7B-GPTQ
llama.cpp + GGUF
CPU/edge inference. Cross-platform (macOS, Linux, Windows). Best for local & mobile deployment.
./main -m model.gguf -p "Hello" -t 4 -ngl 32
TensorRT-LLM
NVIDIA optimized inference engine. INT4/INT8, fp8. Best latency on A100/H100.
Triton Inference Server
Multi-model server. Supports vLLM, TensorRT-LLM, llama.cpp backends. Advanced batching.
vLLM Deployment Example
from vllm import LLM, SamplingParams
# Load quantized model
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
tensor_parallel_size=1,
dtype="float16"
)
# Batch inference
prompts = ["Hello, how are you?", "What is AI?"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm.generate(prompts, sampling_params)
# Streaming API
for output in outputs:
print(f"Generated: {output.outputs[0].text}")
Deployment Matrix
| Scenario | Best Framework | Quantization | Throughput | Latency |
|---|---|---|---|---|
| Production GPU (multi-user) | vLLM | AWQ/GPTQ | 600+ tok/s | ~50ms |
| Local/CPU | llama.cpp | GGUF Q4_K | 50-100 tok/s | 50-100ms |
| Mobile/Edge | llama.cpp/Ollama | GGUF Q3 | 10-30 tok/s | 100-200ms |
| Multi-model serving | Triton + vLLM | AWQ | 800+ tok/s | ~40ms |
| High-latency apps | TensorRT-LLM | INT8/INT4 | Variable | ~20ms |
Choose framework: vLLM for GPU, llama.cpp for CPU. Quantization: AWQ for throughput, GPTQ for compatibility, GGUF for edge. Monitoring: Track latency, throughput, error rates. Scaling: Use tensor parallelism or multi-instance for large models.
Cost Analysis & ROI
When Quantization Pays for Itself
Memory & Cost Savings
| Format | Model Size (7B) | GPU Memory (A100) | GPU Cost/hour | Cost per 1M tokens | vs FP32 |
|---|---|---|---|---|---|
| FP32 | 28GB | 2x A100 (80GB) | $10.24 | $0.35 | 1.0x |
| FP16 | 14GB | 1x A100 | $5.12 | $0.17 | 0.5x |
| GPTQ INT4 | 3.5GB | 1x A100 (shared) | $2.56 | $0.07 | 0.2x |
| AWQ INT4 | 3.5GB | 1x A100 (shared) | $2.56 | $0.04 | 0.11x |
| GGUF (CPU) | 1.8GB | 32GB RAM + 4 vCPU | $1.20 | $0.02 | 0.06x |
Break-Even Analysis
Scenario: 10M tokens/day
FP16 cost: $1,700/month
GPTQ cost: $700/month
Savings: $1,000/month (59%)
ROI: Breaks even in 1 week
Scenario: 100M tokens/day
FP16 cost: $17,000/month
AWQ cost: $4,000/month
Savings: $13,000/month (76%)
ROI: Breaks even in 2 days
Quantization Effort vs Savings
| Method | Effort | Time | Quality Loss | Savings | Break-Even |
|---|---|---|---|---|---|
| GPTQ | Low | 1-5h | <2% | 50-70% | 1 week (10M/day) |
| AWQ | Low | 10-30m | <1% | 70-85% | 3 days (10M/day) |
| QAT | High | 1-7 days | <1% | 70-80% | 1 day (100M/day) |
| Recommendation: Use AWQ/GPTQ for most cases. Effort (hours) pays for itself within days for any reasonable production load. | |||||
Cloud GPU Pricing Reference (March 2026)
| GPU | VRAM | On-Demand $/hr | Spot $/hr | Best Quant Use |
|---|---|---|---|---|
| A100 80GB | 80GB | $2.00-3.00 | $1.00-1.80 | Run GPTQ/AWQ quantization; serve INT4 70B |
| A40 48GB | 48GB | $0.80-1.20 | $0.40-0.70 | Serve quantized 7-13B; cost-optimal inference |
| T4 16GB | 16GB | $0.35-0.76 | $0.12-0.30 | Serve quantized 3-7B; cheapest GPU option |
| CPU (32GB RAM) | N/A | $0.10-0.30 | N/A | GGUF models; edge/local deployment |
Quantized Self-Host vs API — Annual Cost (1M req/day)
| Approach | Monthly Cost | Annual Cost | Latency (P50) | vs GPT-4o API |
|---|---|---|---|---|
| GPT-4o API (500 tok/req) | $7,500 | $90,000 | 500-2000ms | Baseline |
| GPT-4o-mini API | $1,125 | $13,500 | 200-800ms | 85% cheaper |
| Llama-3.3-8B FP16 (A100) | $2,160 | $25,920 | 50-150ms | 71% cheaper |
| Llama-3.3-8B GPTQ-INT4 (A40) | $864 | $10,368 | 30-80ms | 88% cheaper |
| Llama-3.3-8B GGUF-Q4 (CPU) | $216 | $2,592 | 200-500ms | 97% cheaper |
Always (immediate ROI): Production serving (>50 req/sec), public APIs, mobile deployment. Usually (weeks payback): >1M tokens/day in API costs. Maybe: Internal tools, batch processing. Maximum savings: Quantize to INT4 + serve on A40/T4 = 88-97% cheaper than premium API, with 3-10x lower latency. Quantization effort is hours; payback is days.
Failure Modes & Mitigation
Common Pitfalls and Recovery Strategies
Outlier Channels
Some weight channels have extreme outliers. Quantizing them directly causes massive quality loss.
Fix: AWQ handles this. Or use per-group quantization. Or skip extreme outliers.
Catastrophic Quality Loss at Q2
2-bit quantization is very aggressive. May cause 10-20% quality loss or complete failure on some models.
Fix: Test empirically. Use Q3+ for safety. Monitor perplexity.}
Calibration Data Mismatch
If calibration data is unrepresentative, quantization will be suboptimal for real data.
Fix: Use diverse, domain-representative data. 100-500 samples.
Hardware Mismatch
GPTQ quantized on GPU A may not work well on GPU B (different architectures, precision).
Fix: Quantize on target hardware. Test cross-hardware compatibility.
Kernel Not Available
GPTQ/AWQ require optimized kernels. Without them, inference is slow or impossible.
Fix: Use vLLM (handles kernels). Or use GGUF (universal).
Attention Score Overflow
INT8 attention can overflow during softmax (intermediate values exceed int range).
Fix: Use FP16 for attention. Or use INT4 with per-group scaling.
Before deploying: Test on target hardware. Measure perplexity & task performance. Use diverse calibration data. Monitor for OOM & overflow. Have FP16 fallback ready. Start with Q4 (safe), then try Q3 if needed.
If quality drops: Increase bits (Q4→Q5), use per-group quantization, re-calibrate. If crashes: Use universal format (GGUF), fallback to FP16, reduce batch size. If slow: Check kernels available, use optimized framework (vLLM).
Tools & Frameworks
Software Ecosystem for Quantization
auto_gptq
GPTQ quantization library. Easy API, good docs. Supports 4-bit INT4 quantization.
pip install auto-gptq
autoawq
AWQ quantization library. Faster calibration, better quality. Best for production.
pip install autoawq
llama.cpp
CPU inference for GGUF models. Cross-platform, production-ready. No GPU needed.
git clone llama.cpp
vLLM
GPU inference engine. Supports GPTQ, AWQ. Best throughput on NVIDIA.
pip install vllm
Ollama
Easy GGUF model management. One-liner setup. Great for beginners.
brew install ollama
TensorRT-LLM
NVIDIA optimized engine. Best latency. Requires CUDA expertise.
pip install tensorrt-llm
bitsandbytes
HuggingFace INT8 quantization. Easy HF integration. Good for training.
pip install bitsandbytes
Intel Neural Compressor
Intel-optimized quantization. Great for CPU inference on x86.
pip install neural-compressor
ONNX Runtime
Cross-platform quantization and inference. Hardware-agnostic.
pip install onnxruntime
Tool Comparison Matrix
| Tool | Purpose | Main Hardware | Ease of Use | Performance |
|---|---|---|---|---|
| auto_gptq | GPTQ quantization | GPU | Easy | Excellent |
| autoawq | AWQ quantization | GPU | Easy | Excellent |
| vLLM | GPU inference | NVIDIA GPU | Medium | Excellent |
| llama.cpp | CPU inference | CPU | Easy | Very Good |
| Ollama | Easy GGUF | CPU/Mac | Very Easy | Good |
| TensorRT-LLM | NVIDIA optimization | NVIDIA GPU | Hard | Best |
GPU server: vLLM + autoawq. Local CPU: Ollama. Custom training: auto_gptq + bitsandbytes. Edge/mobile: llama.cpp + GGUF.
Research Papers & References
Key Publications on LLM Quantization
Foundational Methods
GPTQ (2023)
Accurate Post-Training Quantization for Generative Pre-Trained Transformers
Frantar et al. (Qualcomm AI Research)
One-shot 4-bit quantization using Hessian. Foundation for modern quantization.
arXiv:2210.17323
AWQ (2024)
Activation-aware Weight Quantization for LLM Compression and Acceleration
Lin et al. (MIT-IBM, Microsoft)
Activation-aware scaling. <1% quality loss at Q4. Better than GPTQ.
arXiv:2306.00978
Advanced Methods
SqueezeLLM (2024)
Sensitive weight identification for 3-bit quantization. Extreme compression with minimal loss. Good for mobile.
QuIP# (2024)
2-bit quantization via vector quantization. Better than naive 2-bit. For extreme edge cases.
AQLM (2023)
Additive Quantization of Language Models. Product quantization approach for better quality at low bits.
BitNet (2024)
1-bit LLMs. Extreme compression, training from scratch. Vision of ultra-efficient models.
Supporting Research
| Paper | Year | Focus | Key Contribution |
|---|---|---|---|
| Quantization and Training of NNs for Efficient Integer-Arithmetic Only Inference | 2018 | QAT | Foundation for fake quantization during training |
| LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | 2022 | INT8 | Row/column-wise INT8. Mixed-precision for outliers |
| ZipLM: Hardware-aware Structured Pruning of Language Models | 2023 | Pruning + Quantization | Combine quantization with structured sparsity |
| OliVe: Optimizing the Latency-Variance for Edge Inference | 2023 | Edge Deployment | Optimize for edge latency consistency |
Benchmarks & Leaderboards
Open LLM Leaderboard
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
MTEB Benchmark
https://huggingface.co/spaces/mteb/leaderboard
Trend: Lower bits possible with better methods (GPTQ→AWQ→SqueezeLLM). Quality: All modern methods preserve 90-99% quality. Future: 1-bit and 2-bit quantization becoming practical. BitNet shows potential for ultra-efficient models.
Glossary of Quantization Terms
21 key technical terms used throughout this guide, organized alphabetically.
A
| Term | Definition |
|---|---|
| AWQ (Activation-Aware Weight Quantization) | A weight-only quantization method that protects salient weights by observing activation distributions rather than just weights. No backpropagation needed, less calibration data than GPTQ. Best throughput with Marlin kernels. |
B
| Term | Definition |
|---|---|
| BF16 (Brain Float 16) | 16-bit format with 8 exponent bits (same range as FP32) and 7 mantissa bits. Better training stability than FP16 because it handles the same numeric range. Preferred for training. |
| Bitsandbytes | A HuggingFace-integrated library for 8-bit and 4-bit quantization. Provides load_in_8bit and load_in_4bit options for instant quantization of any model. Powers QLoRA. |
C
| Term | Definition |
|---|---|
| Calibration Data | A small representative dataset (128-1024 samples) used by PTQ methods to determine optimal quantization parameters. Quality of calibration data directly impacts quantized model quality. |
D
| Term | Definition |
|---|---|
| Dequantization | Converting quantized (low-precision) values back to higher precision for computation. Happens dynamically during inference for weight-only quantization methods. |
F
| Term | Definition |
|---|---|
| FP16 (Half Precision) | 16-bit floating point format with 5 exponent bits and 10 mantissa bits. Halves memory vs FP32 with minimal quality loss. Standard for inference; requires loss scaling for training. |
G
| Term | Definition |
|---|---|
| GGUF | A file format for storing quantized models, optimized for CPU and Apple Silicon inference. Supports multiple quantization levels (Q2_K through Q8_0). Used by llama.cpp and Ollama. |
| GPTQ (Generative Pre-trained Transformer Quantization) | A one-shot weight quantization method using approximate second-order information (Hessian) for optimal rounding. First method to achieve 4-bit LLM quantization with minimal accuracy loss. |
| Group Quantization | Quantizing weights in groups (e.g., 128 elements share one scale/zero-point) rather than per-tensor or per-channel. Provides finer granularity and better quality than per-tensor quantization. |
I
| Term | Definition |
|---|---|
| INT4 / INT8 | 4-bit and 8-bit integer representations. INT8 provides ~4× compression with 1-2% quality loss. INT4 provides ~8× compression but may need QAT or careful calibration to maintain quality. |
M
| Term | Definition |
|---|---|
| Marlin Kernels | Optimized GPU kernels for quantized matrix multiplication, providing 2.6× speedup for GPTQ and 10.9× for AWQ over baseline implementations. The key to production quantized inference performance. |
| Mixed-Precision Quantization | Using different precision levels for different layers or components (e.g., INT8 for most layers, FP16 for sensitive attention layers). Balances compression and quality. |
O
| Term | Definition |
|---|---|
| Outlier Channels | Weight channels with extreme values that cause disproportionate quality degradation when quantized. AWQ specifically addresses this by protecting salient weights based on activation patterns. |
P
| Term | Definition |
|---|---|
| Per-Channel Quantization | Computing separate scale and zero-point values for each output channel of a weight matrix. More accurate than per-tensor but requires more storage for quantization parameters. |
| Post-Training Quantization (PTQ) | Quantizing a trained model without retraining. Fast (minutes to hours) but may have higher quality loss than QAT. Methods: GPTQ, AWQ, Round-To-Nearest. |
Q
| Term | Definition |
|---|---|
| QAT (Quantization-Aware Training) | Simulating quantization during training so the model learns to compensate for precision loss. Better quality than PTQ but requires full training infrastructure and compute. |
| Quantization | Reducing the numerical precision of model weights and/or activations to decrease memory footprint and increase inference speed. The most impactful single optimization for production LLM deployment. |
S
| Term | Definition |
|---|---|
| Scale Factor | A floating-point multiplier used to map between quantized integer values and their original floating-point range. Computed as (max - min) / (2bits - 1). |
| SmoothQuant | A quantization technique that migrates quantization difficulty from activations to weights by applying mathematically equivalent per-channel scaling. Enables effective activation quantization. |
W
| Term | Definition |
|---|---|
| Weight-Only Quantization | Quantizing only model weights while keeping activations in higher precision (FP16). Simpler than weight+activation quantization and preserves most quality. Used by GPTQ and AWQ. |
Z
| Term | Definition |
|---|---|
| Zero-Point | An integer value representing the quantized equivalent of floating-point zero. Used in asymmetric quantization: q = round(x/scale + zero_point). |