LLM Quantization — Compression Techniques & Production Deployment

Comprehensive guide to quantization methods for large language models — from precision reduction through GPTQ, AWQ, GGUF to production inference at 2-8x compression with 90-99% quality retention.

Quantization Model Compression Inference Optimization Production Deployment Cost Reduction
2-8x
Compression Ratio
90-99%
Quality Retention
2-10x
Speedup
50-80%
Cost Reduction

A comprehensive guide to quantization techniques for large language models. Learn precision reduction, weight quantization, post-training & quantization-aware approaches, and production deployment patterns with vLLM, llama.cpp, and TensorRT-LLM.

What is Quantization?

Reducing Precision of Weights and Activations

FP32 32-bit float ~4GB per 1B params Baseline FP16 16-bit float ~2GB per 1B params 2x compression INT8 8-bit integer ~1GB per 1B params 4x compression INT4 4-bit integer ~500MB per 1B params 8x compression Quality vs Size FP32: 100% quality INT4: 90-99% quality 8x smaller

What is Quantization?

Quantization reduces the precision of model weights and activations from floating-point (FP32/FP16) to lower-bit integer representations (INT8/INT4). This dramatically reduces model size and memory, enabling faster inference and lower costs while maintaining 90-99% of original quality.

Why Quantization Matters

  • ✓ 2-8x memory reduction
  • ✓ 2-10x inference speedup
  • ✓ 50-80% cost savings
  • ✓ Deploy to edge & mobile
  • ✓ Enable real-time inference
Key Insight

Modern quantization methods (GPTQ, AWQ) preserve 90-99% of model quality while achieving 8x compression. This makes them essential for production LLM deployment where cost and latency are critical.

Quantization Fundamentals

Core Concepts & Approaches

Post-Training Quantization (PTQ)

Apply quantization after training is complete. No retraining needed. Calibrate on small dataset (100-1000 samples). Fast but may have quality loss.

Best for: Quick deployment, limited compute

Quantization-Aware Training (QAT)

Simulate quantization during training. Model learns to compensate for precision loss. Higher quality but requires full retraining.

Best for: High-quality models, custom training

Weight-Only Quantization

Quantize only weights; keep activations in FP32/FP16. Simpler to implement, minimal accuracy loss, still achieves 4-8x speedup.

Weight+Activation Quantization

Quantize both weights and activations. Maximum speedup, more quality loss. Requires careful calibration.

Quantization Types

Symmetric Quantization

Range is symmetric around zero. Formula: q = round(x / scale)

Asymmetric Quantization

Range may not be symmetric. Formula: q = round(x/scale + zero_point)

Per-Tensor Quantization

One scale factor per entire tensor. Fast but less accurate for heterogeneous values.

Per-Channel Quantization

Different scale per channel. Better quality but higher overhead. Popular for weights.

Core Formulas

Quantization: q = clamp(round(x / scale + zero_point), q_min, q_max)
Dequantization: x' = (q - zero_point) × scale
Scale: scale = (max - min) / (2^bits - 1)
Zero Point: zero_point = -round(min / scale)

Key Parameters

Bits: 4 (INT4) or 8 (INT8). Lower bits = more compression but more quality loss. Calibration: Choose representative data. Group size: Per-group (e.g., 128) balances quality and speed. Outliers: Special handling for extreme values.

GPTQ: GPU-Quantized LLMs

Efficient 4-bit Quantization via Hessian Information

What is GPTQ?

One-shot weight quantization using approximate second-order info (Hessian). Quantizes one layer at a time with optimal rounding. Achieves 4-bit quantization with minimal accuracy loss (< 2%).

Key Advantages

  • ✓ 4-bit quantization (8x compression)
  • ✓ <2% accuracy loss
  • ✓ Fast inference on GPUs
  • ✓ No retraining needed
  • ✓ Works with any model

How GPTQ Works

Step 1: Layer-by-layer quantization. Step 2: For each weight, find optimal quantized value using Hessian (inverse of Fisher information matrix). Step 3: Minimize reconstruction error: ||W - Q||_H^2 where H is Hessian-weighted. Step 4: Update subsequent weights to compensate.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig # Configuration quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Per-group size desc_act=True, # Desc activation static_groups=False, ) # Load and quantize model = AutoGPTQForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantize_config=quantize_config, device_map="auto" ) # Calibrate on data calibration_dataset = [...] # 100-500 samples model.quantize(calibration_dataset) # Inference inputs = tokenizer("Hello, world!", return_tensors="pt") outputs = model.generate(**inputs, max_length=100)
Metric GPTQ AWQ GGUF (Q4_K)
Compression 8x 8x 8x
Quality Loss <2% <1% 2-3%
GPU Speedup (fp16→q4) 2.6x 3.2x N/A
Calibration Time 1-5 hours 10-30 min 30-60 min
Best For GPU inference High-throughput serving CPU/edge
Best Practices

Calibration: Use 100-500 diverse samples from your domain. Group size: 128 is typical. Smaller (32) = better quality, larger (256) = faster. Hardware: Quantize on same hardware as inference (GPU type matters).

AWQ: Activation-Aware Quantization

Protecting Salient Weights Based on Activation Distributions

What is AWQ?

Activation-aware weight quantization. Analyzes activation distributions to identify which weights are most important. Protects salient weights with higher precision, quantizes others aggressively.

Advantages over GPTQ

  • ✓ <1% accuracy loss
  • ✓ 10-30 min calibration
  • ✓ Less calibration data needed
  • ✓ 3.2x GPU speedup (better kernels)
  • ✓ Best for vLLM serving

How AWQ Works

AWQ searches for weight scaling factors that minimize quantization error weighted by activation magnitudes. For weights with high activations, use finer quantization. For low-activation weights, quantize aggressively. Optimal scaling factors are computed per-channel via iterative optimization.

from awq import AutoAWQForCausalLM # Load model model = AutoAWQForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf" ) # Quantize config quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # Calibration and quantization calibration_dataset = [...] # 32-256 samples of your data model.quantize( calibration_dataset, quant_config=quant_config, n_parallel_calib_batches=8 ) # Save quantized model model.save_quantized("./llama2-7b-awq")
Marlin Kernels: Best Throughput

Marlin-AWQ kernels: 741 tokens/sec (Llama-2-7B on H100). 10.9x speedup vs FP16. Best for high-throughput inference with vLLM. Combine with batching for maximum throughput.

GGUF: Format for Quantized Models

CPU & Apple Silicon Optimized Quantization Levels

What is GGUF?

GGUF (GGML Universal Format) is a file format for quantized models evolved from GGML. Supports multiple quantization levels (Q2_K to Q8_0) optimized for CPU and Apple Silicon. Used by llama.cpp and Ollama.

Key Features

  • ✓ Multiple quantization levels
  • ✓ CPU-optimized inference
  • ✓ Apple Silicon M1-M4 support
  • ✓ Edge & local deployment
  • ✓ Ollama integration

GGUF Quantization Levels

Level Bits Size (7B model) Quality Speed Use Case
Q2_K 2-3 ~800MB Poor (80-85%) Very Fast Extreme edge, IoT
Q3_K_S 3 ~1GB Fair (85-90%) Very Fast Phone, edge
Q3_K_M 3 ~1.2GB Good (88-92%) Fast Mobile edge
Q4_K_S 4 ~1.5GB Very Good (92-95%) Fast Consumer laptops
Q4_K_M 4 ~1.8GB Excellent (95-97%) Medium Recommended default
Q5_K_S 5 ~2.2GB Excellent (97-98%) Medium Desktop
Q6_K 6 ~2.7GB Excellent (98%+) Slow High quality local
Q8_0 8 ~3.5GB Near-lossless (99%+) Slow Reference

Using GGUF with llama.cpp

# Download GGUF model (from HF or Ollama) curl -o llama2-7b-q4_k_m.gguf \ https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/... # Run inference with llama.cpp ./main -m llama2-7b-q4_k_m.gguf \ -p "The answer to life is" \ -n 32 \ -t 4 \ -ngl 0 # 0 = CPU only, 32+ = GPU layers # Or use Ollama (one-liner) ollama run TheBloke/llama2-7b-q4_k_m
Recommendation

Default choice: Q4_K_M balances quality (95-97%) and size. For high quality: Q5_K_S or Q6_K. For mobile: Q3_K_M or Q4_K_S. For edge IoT: Q2_K or Q3_K_S.

QAT vs PTQ: Training & Inference

When to Quantize-Aware vs Post-Training

Quantization-Aware Training (QAT)

Simulate quantization during training. Model learns to minimize quantization noise. Higher quality but requires full retraining and more compute.

Pros: <1% quality loss, better for aggressive quantization

Cons: 5-100x more compute time

Post-Training Quantization (PTQ)

Apply quantization after training. Quick, no retraining. Calibrate on small dataset. May have 2-5% quality loss.

Pros: Fast (hours, not weeks), uses pretrained weights

Cons: Up to 5% quality loss for aggressive quantization

Aspect QAT PTQ Recommendation
Calibration Data 10K-100K samples 100-1000 samples PTQ (less data)
Time 1-7 days 1-5 hours PTQ (50-100x faster)
Quality Loss <1% 2-5% QAT (better)
GPU Memory 32-80GB 8-16GB PTQ (cheaper)
4-bit Feasibility Good Good (8bit better) Both work
Production Use Critical models General deployment PTQ standard

PyTorch QAT Example

import torch import torch.quantization as tq # Prepare model for QAT model.qconfig = tq.get_default_qat_qconfig('fbgemm') tq.prepare_qat(model, inplace=True) # Train with quantization simulation for epoch in range(5): for batch in train_loader: optimizer.zero_grad() outputs = model(batch["input_ids"]) loss = criterion(outputs.logits, batch["labels"]) loss.backward() optimizer.step() # Convert to quantized model tq.convert(model, inplace=True) # Now weights are INT8, inference is fast outputs = model(test_inputs)
Decision Framework

Use PTQ if: Time is critical, model is general-purpose, acceptable 2-5% quality loss. Use QAT if: Model is mission-critical, willing to invest weeks, need <1% loss, doing aggressive 2-4 bit quantization.

Hardware Guide for Quantized Models

Choosing Quantization Methods by Platform

NVIDIA GPUs (A100/H100)

Best: GPTQ, AWQ (INT4/INT8)
Kernels: Marlin, CUTLASS
Speed: 2-10x over FP16
Use: High-throughput serving

CPU (x86)

Best: GGUF, INT8
Kernels: AVX-512, AMX (Intel)
Speed: 2-4x over FP32
Use: Local inference

Apple Silicon (M1-M4)

Best: GGUF, QAT
Kernels: Metal, ANE
Speed: 2-8x due to unified memory
Use: On-device macOS/iOS

Mobile (Snapdragon)

Best: INT4, Q3_K
Kernels: Hexagon DSP
Speed: 3-6x vs FP32
Use: On-device Android

ARM Cortex (Edge)

Best: GGUF Q2-Q4
Kernels: NEON, SVE
Speed: 2-4x over FP32
Use: IoT, edge devices

TPU (Google)

Best: INT8, INT4
Kernels: Native INT8
Speed: 4-8x advantage
Use: Google Cloud

Hardware × Quantization Compatibility Matrix

Hardware GPTQ AWQ GGUF INT8 QAT Recommended
A100/H100 (GPU) ✓ Excellent ✓ Best ○ Slow ✓ Good AWQ
RTX 4090 (Gaming GPU) ✓ Good ✓ Good ○ Slow ✓ Fair GPTQ
CPU (Intel/AMD) ✗ Slow ✗ Slow ✓ Good ✓ Good GGUF
Apple Silicon ✗ Not optimized ✗ Not optimized ✓ Excellent ✓ Good GGUF
Mobile (Snapdragon) ✗ No ✗ No ✓ OK ✓ Good INT4 QAT
IoT/Edge (ARM) ✗ No ✗ No ✓ Good ✓ Good GGUF
Quick Decision Guide

GPU Server (NVIDIA): AWQ. CPU Local: GGUF Q4_K_M. Apple Mac/iPhone: GGUF. Mobile Android: INT4 QAT. Edge IoT: GGUF Q2-Q3.

Quantization Benchmarks

Quality, Speed & Memory Trade-offs

Perplexity: Quality Retention Across Methods

Model FP32 PPL GPTQ (Q4) AWQ (Q4) GGUF (Q4_K) QAT (INT8)
Llama-2-7B 10.63 10.81 (+1.7%) 10.68 (+0.5%) 10.92 (+2.7%) 10.65 (+0.2%)
Mistral-7B 8.22 8.41 (+2.3%) 8.27 (+0.6%) 8.55 (+4%) 8.24 (+0.2%)
Llama-3-8B 9.14 9.35 (+2.3%) 9.18 (+0.4%) 9.48 (+3.7%) 9.16 (+0.2%)
Key Finding: All methods stay within 6% of baseline. AWQ/QAT best (<1% loss). GGUF acceptable (2-4% loss) for edge.

Throughput (tokens/sec) - Llama-2-7B on H100

Method Tokens/sec vs FP16 Baseline Batch Size Notes
FP16 (baseline) 230 1.0x 32 Full precision
GPTQ (Q4) 590 2.6x 64 Standard GPTQ kernels
Marlin-GPTQ (Q4) 710 3.1x 128 Optimized GPU kernels
AWQ (Q4) 650 2.8x 64 Activation-aware
Marlin-AWQ (Q4) 741 3.2x 128 Best production throughput
INT8 QAT 520 2.3x 32 General hardware support
Key Finding: Marlin kernels provide 10.9x (AWQ) and 2.6x (GPTQ) speedups. Best for high-throughput serving.

Memory Usage & Inference Speed

Format Model Size (7B) Memory Peak (Inference) vs FP32 CPU Inference (ms)
FP32 28GB 32GB 1.0x 850
FP16 14GB 16GB 0.5x 500
GPTQ INT4 3.5GB 6GB 0.19x 120
GGUF Q4_K_M 1.8GB 4GB 0.13x 200
GGUF Q3_K_M 1.2GB 3GB 0.09x 280
Benchmark Summary

Quality: All methods achieve 90-99% of baseline quality. Speed: GPU quantization 2.6-3.2x faster. Marlin kernels are best. Memory: INT4 is 8x smaller than FP32. GGUF optimal for CPU.

Quantized Model Directory

Popular Models & Where to Find Them

TheBloke Collections (HuggingFace)

Model Name GPTQ (4-bit) AWQ (4-bit) GGUF Quality Notes
Llama-2-7B 97-98% Most popular, well-tested
Mistral-7B 96-97% Sliding window, fast inference
Llama-3-8B 97%+ Latest, strong performance
Qwen-2-7B 96%+ Multilingual, strong reasoning
Mixtral-8x7B 98% MoE, selective expert activation
Llama-3-70B 99%+ Large, high quality

GGUF Models on Ollama & HuggingFace

Popular GGUF Collections

  • llama.cpp - Original (on GitHub)
  • Ollama - Easiest (ollama.ai)
  • TheBloke - Comprehensive (HF)
  • xBITx - Compact models (HF)
  • mradermacher - Audio/multimodal

Installation (Ollama)

ollama pull llama2:7b
ollama pull mistral
ollama pull neural-chat
ollama run llama2:7b
Done! Ready to use.

Model Quantization Quality Tiers

Local/Mobile (Q2-Q4)

Q2_K: 800MB, poor quality
Q3_K_M: 1.2GB, fair quality
Q4_K_M: 1.8GB, very good

Server/Desktop (Q4-Q8)

Q5_K_S: 2.2GB, excellent
Q6_K: 2.7GB, high quality
Q8_0: 3.5GB, near-lossless

Quick Start

Want to try now? ollama run llama2:7b-q4 (takes 2 mins). GPU server? Use TheBloke GPTQ or AWQ. Local CPU? Download Q4_K_M GGUF. All models on HuggingFace.

Deployment Strategies

vLLM, llama.cpp, TensorRT-LLM & Triton

vLLM + AWQ/GPTQ

High-throughput serving on NVIDIA GPUs. Paged attention, continuous batching. Marlin kernels for INT4.

pip install vllm
python -m vllm.entrypoints.openai_compatible_server \
--model TheBloke/Llama-2-7B-GPTQ

llama.cpp + GGUF

CPU/edge inference. Cross-platform (macOS, Linux, Windows). Best for local & mobile deployment.

./main -m model.gguf -p "Hello" -t 4 -ngl 32

TensorRT-LLM

NVIDIA optimized inference engine. INT4/INT8, fp8. Best latency on A100/H100.

Triton Inference Server

Multi-model server. Supports vLLM, TensorRT-LLM, llama.cpp backends. Advanced batching.

vLLM Deployment Example

from vllm import LLM, SamplingParams # Load quantized model llm = LLM( model="TheBloke/Llama-2-7B-GPTQ", tensor_parallel_size=1, dtype="float16" ) # Batch inference prompts = ["Hello, how are you?", "What is AI?"] sampling_params = SamplingParams(temperature=0.7, top_p=0.9) outputs = llm.generate(prompts, sampling_params) # Streaming API for output in outputs: print(f"Generated: {output.outputs[0].text}")

Deployment Matrix

Scenario Best Framework Quantization Throughput Latency
Production GPU (multi-user) vLLM AWQ/GPTQ 600+ tok/s ~50ms
Local/CPU llama.cpp GGUF Q4_K 50-100 tok/s 50-100ms
Mobile/Edge llama.cpp/Ollama GGUF Q3 10-30 tok/s 100-200ms
Multi-model serving Triton + vLLM AWQ 800+ tok/s ~40ms
High-latency apps TensorRT-LLM INT8/INT4 Variable ~20ms
Production Checklist

Choose framework: vLLM for GPU, llama.cpp for CPU. Quantization: AWQ for throughput, GPTQ for compatibility, GGUF for edge. Monitoring: Track latency, throughput, error rates. Scaling: Use tensor parallelism or multi-instance for large models.

Cost Analysis & ROI

When Quantization Pays for Itself

Memory & Cost Savings

Format Model Size (7B) GPU Memory (A100) GPU Cost/hour Cost per 1M tokens vs FP32
FP32 28GB 2x A100 (80GB) $10.24 $0.35 1.0x
FP16 14GB 1x A100 $5.12 $0.17 0.5x
GPTQ INT4 3.5GB 1x A100 (shared) $2.56 $0.07 0.2x
AWQ INT4 3.5GB 1x A100 (shared) $2.56 $0.04 0.11x
GGUF (CPU) 1.8GB 32GB RAM + 4 vCPU $1.20 $0.02 0.06x

Break-Even Analysis

Scenario: 10M tokens/day

FP16 cost: $1,700/month
GPTQ cost: $700/month
Savings: $1,000/month (59%)
ROI: Breaks even in 1 week

Scenario: 100M tokens/day

FP16 cost: $17,000/month
AWQ cost: $4,000/month
Savings: $13,000/month (76%)
ROI: Breaks even in 2 days

Quantization Effort vs Savings

Method Effort Time Quality Loss Savings Break-Even
GPTQ Low 1-5h <2% 50-70% 1 week (10M/day)
AWQ Low 10-30m <1% 70-85% 3 days (10M/day)
QAT High 1-7 days <1% 70-80% 1 day (100M/day)
Recommendation: Use AWQ/GPTQ for most cases. Effort (hours) pays for itself within days for any reasonable production load.

Cloud GPU Pricing Reference (March 2026)

GPU VRAM On-Demand $/hr Spot $/hr Best Quant Use
A100 80GB 80GB $2.00-3.00 $1.00-1.80 Run GPTQ/AWQ quantization; serve INT4 70B
A40 48GB 48GB $0.80-1.20 $0.40-0.70 Serve quantized 7-13B; cost-optimal inference
T4 16GB 16GB $0.35-0.76 $0.12-0.30 Serve quantized 3-7B; cheapest GPU option
CPU (32GB RAM) N/A $0.10-0.30 N/A GGUF models; edge/local deployment

Quantized Self-Host vs API — Annual Cost (1M req/day)

Approach Monthly Cost Annual Cost Latency (P50) vs GPT-4o API
GPT-4o API (500 tok/req) $7,500 $90,000 500-2000ms Baseline
GPT-4o-mini API $1,125 $13,500 200-800ms 85% cheaper
Llama-3.3-8B FP16 (A100) $2,160 $25,920 50-150ms 71% cheaper
Llama-3.3-8B GPTQ-INT4 (A40) $864 $10,368 30-80ms 88% cheaper
Llama-3.3-8B GGUF-Q4 (CPU) $216 $2,592 200-500ms 97% cheaper
When Quantization Makes Sense

Always (immediate ROI): Production serving (>50 req/sec), public APIs, mobile deployment. Usually (weeks payback): >1M tokens/day in API costs. Maybe: Internal tools, batch processing. Maximum savings: Quantize to INT4 + serve on A40/T4 = 88-97% cheaper than premium API, with 3-10x lower latency. Quantization effort is hours; payback is days.

Failure Modes & Mitigation

Common Pitfalls and Recovery Strategies

Outlier Channels

Some weight channels have extreme outliers. Quantizing them directly causes massive quality loss.

Fix: AWQ handles this. Or use per-group quantization. Or skip extreme outliers.

Catastrophic Quality Loss at Q2

2-bit quantization is very aggressive. May cause 10-20% quality loss or complete failure on some models.

Fix: Test empirically. Use Q3+ for safety. Monitor perplexity.}

Calibration Data Mismatch

If calibration data is unrepresentative, quantization will be suboptimal for real data.

Fix: Use diverse, domain-representative data. 100-500 samples.

Hardware Mismatch

GPTQ quantized on GPU A may not work well on GPU B (different architectures, precision).

Fix: Quantize on target hardware. Test cross-hardware compatibility.

Kernel Not Available

GPTQ/AWQ require optimized kernels. Without them, inference is slow or impossible.

Fix: Use vLLM (handles kernels). Or use GGUF (universal).

Attention Score Overflow

INT8 attention can overflow during softmax (intermediate values exceed int range).

Fix: Use FP16 for attention. Or use INT4 with per-group scaling.

Mitigation Checklist

Before deploying: Test on target hardware. Measure perplexity & task performance. Use diverse calibration data. Monitor for OOM & overflow. Have FP16 fallback ready. Start with Q4 (safe), then try Q3 if needed.

Recovery Strategies

If quality drops: Increase bits (Q4→Q5), use per-group quantization, re-calibrate. If crashes: Use universal format (GGUF), fallback to FP16, reduce batch size. If slow: Check kernels available, use optimized framework (vLLM).

Tools & Frameworks

Software Ecosystem for Quantization

auto_gptq

GPTQ quantization library. Easy API, good docs. Supports 4-bit INT4 quantization.

pip install auto-gptq

autoawq

AWQ quantization library. Faster calibration, better quality. Best for production.

pip install autoawq

llama.cpp

CPU inference for GGUF models. Cross-platform, production-ready. No GPU needed.

git clone llama.cpp

vLLM

GPU inference engine. Supports GPTQ, AWQ. Best throughput on NVIDIA.

pip install vllm

Ollama

Easy GGUF model management. One-liner setup. Great for beginners.

brew install ollama

TensorRT-LLM

NVIDIA optimized engine. Best latency. Requires CUDA expertise.

pip install tensorrt-llm

bitsandbytes

HuggingFace INT8 quantization. Easy HF integration. Good for training.

pip install bitsandbytes

Intel Neural Compressor

Intel-optimized quantization. Great for CPU inference on x86.

pip install neural-compressor

ONNX Runtime

Cross-platform quantization and inference. Hardware-agnostic.

pip install onnxruntime

Tool Comparison Matrix

Tool Purpose Main Hardware Ease of Use Performance
auto_gptq GPTQ quantization GPU Easy Excellent
autoawq AWQ quantization GPU Easy Excellent
vLLM GPU inference NVIDIA GPU Medium Excellent
llama.cpp CPU inference CPU Easy Very Good
Ollama Easy GGUF CPU/Mac Very Easy Good
TensorRT-LLM NVIDIA optimization NVIDIA GPU Hard Best
Quick Start Stack

GPU server: vLLM + autoawq. Local CPU: Ollama. Custom training: auto_gptq + bitsandbytes. Edge/mobile: llama.cpp + GGUF.

Research Papers & References

Key Publications on LLM Quantization

Foundational Methods

GPTQ (2023)

Accurate Post-Training Quantization for Generative Pre-Trained Transformers
Frantar et al. (Qualcomm AI Research)
One-shot 4-bit quantization using Hessian. Foundation for modern quantization.

arXiv:2210.17323

AWQ (2024)

Activation-aware Weight Quantization for LLM Compression and Acceleration
Lin et al. (MIT-IBM, Microsoft)
Activation-aware scaling. <1% quality loss at Q4. Better than GPTQ.

arXiv:2306.00978

Advanced Methods

SqueezeLLM (2024)

Sensitive weight identification for 3-bit quantization. Extreme compression with minimal loss. Good for mobile.

QuIP# (2024)

2-bit quantization via vector quantization. Better than naive 2-bit. For extreme edge cases.

AQLM (2023)

Additive Quantization of Language Models. Product quantization approach for better quality at low bits.

BitNet (2024)

1-bit LLMs. Extreme compression, training from scratch. Vision of ultra-efficient models.

Supporting Research

Paper Year Focus Key Contribution
Quantization and Training of NNs for Efficient Integer-Arithmetic Only Inference 2018 QAT Foundation for fake quantization during training
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale 2022 INT8 Row/column-wise INT8. Mixed-precision for outliers
ZipLM: Hardware-aware Structured Pruning of Language Models 2023 Pruning + Quantization Combine quantization with structured sparsity
OliVe: Optimizing the Latency-Variance for Edge Inference 2023 Edge Deployment Optimize for edge latency consistency

Benchmarks & Leaderboards

Open LLM Leaderboard

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

MTEB Benchmark

https://huggingface.co/spaces/mteb/leaderboard

Key Takeaways from Research

Trend: Lower bits possible with better methods (GPTQ→AWQ→SqueezeLLM). Quality: All modern methods preserve 90-99% quality. Future: 1-bit and 2-bit quantization becoming practical. BitNet shows potential for ultra-efficient models.

Glossary of Quantization Terms

21 key technical terms used throughout this guide, organized alphabetically.

A

TermDefinition
AWQ (Activation-Aware Weight Quantization)A weight-only quantization method that protects salient weights by observing activation distributions rather than just weights. No backpropagation needed, less calibration data than GPTQ. Best throughput with Marlin kernels.

B

TermDefinition
BF16 (Brain Float 16)16-bit format with 8 exponent bits (same range as FP32) and 7 mantissa bits. Better training stability than FP16 because it handles the same numeric range. Preferred for training.
BitsandbytesA HuggingFace-integrated library for 8-bit and 4-bit quantization. Provides load_in_8bit and load_in_4bit options for instant quantization of any model. Powers QLoRA.

C

TermDefinition
Calibration DataA small representative dataset (128-1024 samples) used by PTQ methods to determine optimal quantization parameters. Quality of calibration data directly impacts quantized model quality.

D

TermDefinition
DequantizationConverting quantized (low-precision) values back to higher precision for computation. Happens dynamically during inference for weight-only quantization methods.

F

TermDefinition
FP16 (Half Precision)16-bit floating point format with 5 exponent bits and 10 mantissa bits. Halves memory vs FP32 with minimal quality loss. Standard for inference; requires loss scaling for training.

G

TermDefinition
GGUFA file format for storing quantized models, optimized for CPU and Apple Silicon inference. Supports multiple quantization levels (Q2_K through Q8_0). Used by llama.cpp and Ollama.
GPTQ (Generative Pre-trained Transformer Quantization)A one-shot weight quantization method using approximate second-order information (Hessian) for optimal rounding. First method to achieve 4-bit LLM quantization with minimal accuracy loss.
Group QuantizationQuantizing weights in groups (e.g., 128 elements share one scale/zero-point) rather than per-tensor or per-channel. Provides finer granularity and better quality than per-tensor quantization.

I

TermDefinition
INT4 / INT84-bit and 8-bit integer representations. INT8 provides ~4× compression with 1-2% quality loss. INT4 provides ~8× compression but may need QAT or careful calibration to maintain quality.

M

TermDefinition
Marlin KernelsOptimized GPU kernels for quantized matrix multiplication, providing 2.6× speedup for GPTQ and 10.9× for AWQ over baseline implementations. The key to production quantized inference performance.
Mixed-Precision QuantizationUsing different precision levels for different layers or components (e.g., INT8 for most layers, FP16 for sensitive attention layers). Balances compression and quality.

O

TermDefinition
Outlier ChannelsWeight channels with extreme values that cause disproportionate quality degradation when quantized. AWQ specifically addresses this by protecting salient weights based on activation patterns.

P

TermDefinition
Per-Channel QuantizationComputing separate scale and zero-point values for each output channel of a weight matrix. More accurate than per-tensor but requires more storage for quantization parameters.
Post-Training Quantization (PTQ)Quantizing a trained model without retraining. Fast (minutes to hours) but may have higher quality loss than QAT. Methods: GPTQ, AWQ, Round-To-Nearest.

Q

TermDefinition
QAT (Quantization-Aware Training)Simulating quantization during training so the model learns to compensate for precision loss. Better quality than PTQ but requires full training infrastructure and compute.
QuantizationReducing the numerical precision of model weights and/or activations to decrease memory footprint and increase inference speed. The most impactful single optimization for production LLM deployment.

S

TermDefinition
Scale FactorA floating-point multiplier used to map between quantized integer values and their original floating-point range. Computed as (max - min) / (2bits - 1).
SmoothQuantA quantization technique that migrates quantization difficulty from activations to weights by applying mathematically equivalent per-channel scaling. Enables effective activation quantization.

W

TermDefinition
Weight-Only QuantizationQuantizing only model weights while keeping activations in higher precision (FP16). Simpler than weight+activation quantization and preserves most quality. Used by GPTQ and AWQ.

Z

TermDefinition
Zero-PointAn integer value representing the quantized equivalent of floating-point zero. Used in asymmetric quantization: q = round(x/scale + zero_point).
Full Reference: For a comprehensive glossary covering ALL LLM topics across all documents, see the unified LLM Glossary with 140+ terms.