LLM Quantization — Compression Techniques & Production Deployment

Comprehensive guide to quantization methods for large language models — from precision reduction through GPTQ, AWQ, GGUF to production inference at 2-8x compression with 90-99% quality retention.

Quantization Model Compression Inference Optimization Production Deployment Cost Reduction

2-8x

Compression Ratio

90-99%

Quality Retention

2-10x

Speedup

50-80%

Cost Reduction

A comprehensive guide to quantization techniques for large language models. Learn precision reduction, weight quantization, post-training & quantization-aware approaches, and production deployment patterns with vLLM, llama.cpp, and TensorRT-LLM.

What is Quantization?

Reducing Precision of Weights and Activations

What is Quantization?

Quantization reduces the precision of model weights and activations from floating-point (FP32/FP16) to lower-bit integer representations (INT8/INT4). This dramatically reduces model size and memory, enabling faster inference and lower costs while maintaining 90-99% of original quality.

Why Quantization Matters

✓ 2-8x memory reduction
✓ 2-10x inference speedup
✓ 50-80% cost savings
✓ Deploy to edge & mobile
✓ Enable real-time inference

Key Insight

Modern quantization methods (GPTQ, AWQ) preserve 90-99% of model quality while achieving 8x compression. This makes them essential for production LLM deployment where cost and latency are critical.

Quantization Fundamentals

Core Concepts & Approaches

Post-Training Quantization (PTQ)

Apply quantization after training is complete. No retraining needed. Calibrate on small dataset (100-1000 samples). Fast but may have quality loss.

Best for: Quick deployment, limited compute

Quantization-Aware Training (QAT)

Simulate quantization during training. Model learns to compensate for precision loss. Higher quality but requires full retraining.

Best for: High-quality models, custom training

Weight-Only Quantization

Quantize only weights; keep activations in FP32/FP16. Simpler to implement, minimal accuracy loss, still achieves 4-8x speedup.

Weight+Activation Quantization

Quantize both weights and activations. Maximum speedup, more quality loss. Requires careful calibration.

Quantization Types

Symmetric Quantization

Range is symmetric around zero. Formula: q = round(x / scale)

Asymmetric Quantization

Range may not be symmetric. Formula: q = round(x/scale + zero_point)

Per-Tensor Quantization

One scale factor per entire tensor. Fast but less accurate for heterogeneous values.

Per-Channel Quantization

Different scale per channel. Better quality but higher overhead. Popular for weights.

Core Formulas

Quantization: q = clamp(round(x / scale + zero_point), q_min, q_max)
Dequantization: x' = (q - zero_point) × scale
Scale: scale = (max - min) / (2^bits - 1)
Zero Point: zero_point = -round(min / scale)

Key Parameters

Bits: 4 (INT4) or 8 (INT8). Lower bits = more compression but more quality loss. Calibration: Choose representative data. Group size: Per-group (e.g., 128) balances quality and speed. Outliers: Special handling for extreme values.

GPTQ: GPU-Quantized LLMs

Efficient 4-bit Quantization via Hessian Information

What is GPTQ?

One-shot weight quantization using approximate second-order info (Hessian). Quantizes one layer at a time with optimal rounding. Achieves 4-bit quantization with minimal accuracy loss (< 2%).

Key Advantages

✓ 4-bit quantization (8x compression)
✓ <2% accuracy loss
✓ Fast inference on GPUs
✓ No retraining needed
✓ Works with any model

How GPTQ Works

Step 1: Layer-by-layer quantization. Step 2: For each weight, find optimal quantized value using Hessian (inverse of Fisher information matrix). Step 3: Minimize reconstruction error: ||W - Q||_H^2 where H is Hessian-weighted. Step 4: Update subsequent weights to compensate.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Configuration
quantize_config = BaseQuantizeConfig(
    bits=4,                  # 4-bit quantization
    group_size=128,          # Per-group size
    desc_act=True,          # Desc activation
    static_groups=False,
)

# Load and quantize
model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantize_config=quantize_config,
    device_map="auto"
)

# Calibrate on data
calibration_dataset = [...] # 100-500 samples
model.quantize(calibration_dataset)

# Inference
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)

                

Metric	GPTQ	AWQ	GGUF (Q4_K)
Compression	8x	8x	8x
Quality Loss	<2%	<1%	2-3%
GPU Speedup (fp16→q4)	2.6x	3.2x	N/A
Calibration Time	1-5 hours	10-30 min	30-60 min
Best For	GPU inference	High-throughput serving	CPU/edge

Best Practices

Calibration: Use 100-500 diverse samples from your domain. Group size: 128 is typical. Smaller (32) = better quality, larger (256) = faster. Hardware: Quantize on same hardware as inference (GPU type matters).

AWQ: Activation-Aware Quantization

Protecting Salient Weights Based on Activation Distributions

What is AWQ?

Activation-aware weight quantization. Analyzes activation distributions to identify which weights are most important. Protects salient weights with higher precision, quantizes others aggressively.

Advantages over GPTQ

✓ <1% accuracy loss
✓ 10-30 min calibration
✓ Less calibration data needed
✓ 3.2x GPU speedup (better kernels)
✓ Best for vLLM serving

How AWQ Works

AWQ searches for weight scaling factors that minimize quantization error weighted by activation magnitudes. For weights with high activations, use finer quantization. For low-activation weights, quantize aggressively. Optimal scaling factors are computed per-channel via iterative optimization.

from awq import AutoAWQForCausalLM

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf"
)

# Quantize config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# Calibration and quantization
calibration_dataset = [...] # 32-256 samples of your data
model.quantize(
    calibration_dataset,
    quant_config=quant_config,
    n_parallel_calib_batches=8
)

# Save quantized model
model.save_quantized("./llama2-7b-awq")

                

Marlin Kernels: Best Throughput

Marlin-AWQ kernels: 741 tokens/sec (Llama-2-7B on H100). 10.9x speedup vs FP16. Best for high-throughput inference with vLLM. Combine with batching for maximum throughput.

GGUF: Format for Quantized Models

CPU & Apple Silicon Optimized Quantization Levels

What is GGUF?

GGUF (GGML Universal Format) is a file format for quantized models evolved from GGML. Supports multiple quantization levels (Q2_K to Q8_0) optimized for CPU and Apple Silicon. Used by llama.cpp and Ollama.

Key Features

✓ Multiple quantization levels
✓ CPU-optimized inference
✓ Apple Silicon M1-M4 support
✓ Edge & local deployment
✓ Ollama integration

GGUF Quantization Levels

Level	Bits	Size (7B model)	Quality	Speed	Use Case
Q2_K	2-3	~800MB	Poor (80-85%)	Very Fast	Extreme edge, IoT
Q3_K_S	3	~1GB	Fair (85-90%)	Very Fast	Phone, edge
Q3_K_M	3	~1.2GB	Good (88-92%)	Fast	Mobile edge
Q4_K_S	4	~1.5GB	Very Good (92-95%)	Fast	Consumer laptops
Q4_K_M	4	~1.8GB	Excellent (95-97%)	Medium	Recommended default
Q5_K_S	5	~2.2GB	Excellent (97-98%)	Medium	Desktop
Q6_K	6	~2.7GB	Excellent (98%+)	Slow	High quality local
Q8_0	8	~3.5GB	Near-lossless (99%+)	Slow	Reference

Using GGUF with llama.cpp

# Download GGUF model (from HF or Ollama)
curl -o llama2-7b-q4_k_m.gguf \
  https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/...

# Run inference with llama.cpp
./main -m llama2-7b-q4_k_m.gguf \
  -p "The answer to life is" \
  -n 32 \
  -t 4 \
  -ngl 0  # 0 = CPU only, 32+ = GPU layers

# Or use Ollama (one-liner)
ollama run TheBloke/llama2-7b-q4_k_m

                

Recommendation

Default choice: Q4_K_M balances quality (95-97%) and size. For high quality: Q5_K_S or Q6_K. For mobile: Q3_K_M or Q4_K_S. For edge IoT: Q2_K or Q3_K_S.

QAT vs PTQ: Training & Inference

When to Quantize-Aware vs Post-Training

Quantization-Aware Training (QAT)

Simulate quantization during training. Model learns to minimize quantization noise. Higher quality but requires full retraining and more compute.

Pros: <1% quality loss, better for aggressive quantization

Cons: 5-100x more compute time

Post-Training Quantization (PTQ)

Apply quantization after training. Quick, no retraining. Calibrate on small dataset. May have 2-5% quality loss.

Pros: Fast (hours, not weeks), uses pretrained weights

Cons: Up to 5% quality loss for aggressive quantization

Aspect	QAT	PTQ	Recommendation
Calibration Data	10K-100K samples	100-1000 samples	PTQ (less data)
Time	1-7 days	1-5 hours	PTQ (50-100x faster)
Quality Loss	<1%	2-5%	QAT (better)
GPU Memory	32-80GB	8-16GB	PTQ (cheaper)
4-bit Feasibility	Good	Good (8bit better)	Both work
Production Use	Critical models	General deployment	PTQ standard

PyTorch QAT Example

import torch
import torch.quantization as tq

# Prepare model for QAT
model.qconfig = tq.get_default_qat_qconfig('fbgemm')
tq.prepare_qat(model, inplace=True)

# Train with quantization simulation
for epoch in range(5):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(batch["input_ids"])
        loss = criterion(outputs.logits, batch["labels"])
        loss.backward()
        optimizer.step()

# Convert to quantized model
tq.convert(model, inplace=True)

# Now weights are INT8, inference is fast
outputs = model(test_inputs)

                

Decision Framework

Use PTQ if: Time is critical, model is general-purpose, acceptable 2-5% quality loss. Use QAT if: Model is mission-critical, willing to invest weeks, need <1% loss, doing aggressive 2-4 bit quantization.

Hardware Guide for Quantized Models

Choosing Quantization Methods by Platform

NVIDIA GPUs (A100/H100)

Best: GPTQ, AWQ (INT4/INT8)
Kernels: Marlin, CUTLASS
Speed: 2-10x over FP16
Use: High-throughput serving

CPU (x86)

Best: GGUF, INT8
Kernels: AVX-512, AMX (Intel)
Speed: 2-4x over FP32
Use: Local inference

Apple Silicon (M1-M4)

Best: GGUF, QAT
Kernels: Metal, ANE
Speed: 2-8x due to unified memory
Use: On-device macOS/iOS

Mobile (Snapdragon)

Best: INT4, Q3_K
Kernels: Hexagon DSP
Speed: 3-6x vs FP32
Use: On-device Android

ARM Cortex (Edge)

Best: GGUF Q2-Q4
Kernels: NEON, SVE
Speed: 2-4x over FP32
Use: IoT, edge devices

TPU (Google)

Best: INT8, INT4
Kernels: Native INT8
Speed: 4-8x advantage
Use: Google Cloud

Hardware × Quantization Compatibility Matrix

Hardware	GPTQ	AWQ	GGUF	INT8 QAT	Recommended
A100/H100 (GPU)	✓ Excellent	✓ Best	○ Slow	✓ Good	AWQ
RTX 4090 (Gaming GPU)	✓ Good	✓ Good	○ Slow	✓ Fair	GPTQ
CPU (Intel/AMD)	✗ Slow	✗ Slow	✓ Good	✓ Good	GGUF
Apple Silicon	✗ Not optimized	✗ Not optimized	✓ Excellent	✓ Good	GGUF
Mobile (Snapdragon)	✗ No	✗ No	✓ OK	✓ Good	INT4 QAT
IoT/Edge (ARM)	✗ No	✗ No	✓ Good	✓ Good	GGUF

Quick Decision Guide

GPU Server (NVIDIA): AWQ. CPU Local: GGUF Q4_K_M. Apple Mac/iPhone: GGUF. Mobile Android: INT4 QAT. Edge IoT: GGUF Q2-Q3.

Quantization Benchmarks

Quality, Speed & Memory Trade-offs

Perplexity: Quality Retention Across Methods

Model	FP32 PPL	GPTQ (Q4)	AWQ (Q4)	GGUF (Q4_K)	QAT (INT8)
Llama-2-7B	10.63	10.81 (+1.7%)	10.68 (+0.5%)	10.92 (+2.7%)	10.65 (+0.2%)
Mistral-7B	8.22	8.41 (+2.3%)	8.27 (+0.6%)	8.55 (+4%)	8.24 (+0.2%)
Llama-3-8B	9.14	9.35 (+2.3%)	9.18 (+0.4%)	9.48 (+3.7%)	9.16 (+0.2%)
Key Finding: All methods stay within 6% of baseline. AWQ/QAT best (<1% loss). GGUF acceptable (2-4% loss) for edge.

Throughput (tokens/sec) - Llama-2-7B on H100

Method	Tokens/sec	vs FP16 Baseline	Batch Size	Notes
FP16 (baseline)	230	1.0x	32	Full precision
GPTQ (Q4)	590	2.6x	64	Standard GPTQ kernels
Marlin-GPTQ (Q4)	710	3.1x	128	Optimized GPU kernels
AWQ (Q4)	650	2.8x	64	Activation-aware
Marlin-AWQ (Q4)	741	3.2x	128	Best production throughput
INT8 QAT	520	2.3x	32	General hardware support
Key Finding: Marlin kernels provide 10.9x (AWQ) and 2.6x (GPTQ) speedups. Best for high-throughput serving.

Memory Usage & Inference Speed

Format	Model Size (7B)	Memory Peak (Inference)	vs FP32	CPU Inference (ms)
FP32	28GB	32GB	1.0x	850
FP16	14GB	16GB	0.5x	500
GPTQ INT4	3.5GB	6GB	0.19x	120
GGUF Q4_K_M	1.8GB	4GB	0.13x	200
GGUF Q3_K_M	1.2GB	3GB	0.09x	280

Benchmark Summary

Quality: All methods achieve 90-99% of baseline quality. Speed: GPU quantization 2.6-3.2x faster. Marlin kernels are best. Memory: INT4 is 8x smaller than FP32. GGUF optimal for CPU.

Quantized Model Directory

Popular Models & Where to Find Them

TheBloke Collections (HuggingFace)

Model Name	GPTQ (4-bit)	AWQ (4-bit)	GGUF	Quality	Notes
Llama-2-7B	✓	✓	✓	97-98%	Most popular, well-tested
Mistral-7B	✓	✓	✓	96-97%	Sliding window, fast inference
Llama-3-8B	✓	✓	✓	97%+	Latest, strong performance
Qwen-2-7B	✓	✓	✓	96%+	Multilingual, strong reasoning
Mixtral-8x7B	✓	✓	✓	98%	MoE, selective expert activation
Llama-3-70B	✓	✓	✓	99%+	Large, high quality

GGUF Models on Ollama & HuggingFace

Popular GGUF Collections

• llama.cpp - Original (on GitHub)
• Ollama - Easiest (ollama.ai)
• TheBloke - Comprehensive (HF)
• xBITx - Compact models (HF)
• mradermacher - Audio/multimodal

Installation (Ollama)

ollama pull llama2:7b
ollama pull mistral
ollama pull neural-chat
ollama run llama2:7b
Done! Ready to use.

Model Quantization Quality Tiers

Local/Mobile (Q2-Q4)

Q2_K: 800MB, poor quality
Q3_K_M: 1.2GB, fair quality
Q4_K_M: 1.8GB, very good

Server/Desktop (Q4-Q8)

Q5_K_S: 2.2GB, excellent
Q6_K: 2.7GB, high quality
Q8_0: 3.5GB, near-lossless

Quick Start

Want to try now? ollama run llama2:7b-q4 (takes 2 mins). GPU server? Use TheBloke GPTQ or AWQ. Local CPU? Download Q4_K_M GGUF. All models on HuggingFace.

Deployment Strategies

vLLM, llama.cpp, TensorRT-LLM & Triton

vLLM + AWQ/GPTQ

High-throughput serving on NVIDIA GPUs. Paged attention, continuous batching. Marlin kernels for INT4.

pip install vllm
python -m vllm.entrypoints.openai_compatible_server \
--model TheBloke/Llama-2-7B-GPTQ

llama.cpp + GGUF

CPU/edge inference. Cross-platform (macOS, Linux, Windows). Best for local & mobile deployment.

./main -m model.gguf -p "Hello" -t 4 -ngl 32

TensorRT-LLM

NVIDIA optimized inference engine. INT4/INT8, fp8. Best latency on A100/H100.

Triton Inference Server

Multi-model server. Supports vLLM, TensorRT-LLM, llama.cpp backends. Advanced batching.

vLLM Deployment Example

from vllm import LLM, SamplingParams

# Load quantized model
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    tensor_parallel_size=1,
    dtype="float16"
)

# Batch inference
prompts = ["Hello, how are you?", "What is AI?"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm.generate(prompts, sampling_params)

# Streaming API
for output in outputs:
    print(f"Generated: {output.outputs[0].text}")

                

Deployment Matrix

Scenario	Best Framework	Quantization	Throughput	Latency
Production GPU (multi-user)	vLLM	AWQ/GPTQ	600+ tok/s	~50ms
Local/CPU	llama.cpp	GGUF Q4_K	50-100 tok/s	50-100ms
Mobile/Edge	llama.cpp/Ollama	GGUF Q3	10-30 tok/s	100-200ms
Multi-model serving	Triton + vLLM	AWQ	800+ tok/s	~40ms
High-latency apps	TensorRT-LLM	INT8/INT4	Variable	~20ms

Production Checklist

Choose framework: vLLM for GPU, llama.cpp for CPU. Quantization: AWQ for throughput, GPTQ for compatibility, GGUF for edge. Monitoring: Track latency, throughput, error rates. Scaling: Use tensor parallelism or multi-instance for large models.

Cost Analysis & ROI

When Quantization Pays for Itself

Memory & Cost Savings

Format	Model Size (7B)	GPU Memory (A100)	GPU Cost/hour	Cost per 1M tokens	vs FP32
FP32	28GB	2x A100 (80GB)	$10.24	$0.35	1.0x
FP16	14GB	1x A100	$5.12	$0.17	0.5x
GPTQ INT4	3.5GB	1x A100 (shared)	$2.56	$0.07	0.2x
AWQ INT4	3.5GB	1x A100 (shared)	$2.56	$0.04	0.11x
GGUF (CPU)	1.8GB	32GB RAM + 4 vCPU	$1.20	$0.02	0.06x

Break-Even Analysis

Scenario: 10M tokens/day

FP16 cost: $1,700/month
GPTQ cost: $700/month
Savings: $1,000/month (59%)
ROI: Breaks even in 1 week

Scenario: 100M tokens/day

FP16 cost: $17,000/month
AWQ cost: $4,000/month
Savings: $13,000/month (76%)
ROI: Breaks even in 2 days

Quantization Effort vs Savings

Method	Effort	Time	Quality Loss	Savings	Break-Even
GPTQ	Low	1-5h	<2%	50-70%	1 week (10M/day)
AWQ	Low	10-30m	<1%	70-85%	3 days (10M/day)
QAT	High	1-7 days	<1%	70-80%	1 day (100M/day)
Recommendation: Use AWQ/GPTQ for most cases. Effort (hours) pays for itself within days for any reasonable production load.

Cloud GPU Pricing Reference (March 2026)

GPU	VRAM	On-Demand $/hr	Spot $/hr	Best Quant Use
A100 80GB	80GB	$2.00-3.00	$1.00-1.80	Run GPTQ/AWQ quantization; serve INT4 70B
A40 48GB	48GB	$0.80-1.20	$0.40-0.70	Serve quantized 7-13B; cost-optimal inference
T4 16GB	16GB	$0.35-0.76	$0.12-0.30	Serve quantized 3-7B; cheapest GPU option
CPU (32GB RAM)	N/A	$0.10-0.30	N/A	GGUF models; edge/local deployment

Quantized Self-Host vs API — Annual Cost (1M req/day)

Approach	Monthly Cost	Annual Cost	Latency (P50)	vs GPT-4o API
GPT-4o API (500 tok/req)	$7,500	$90,000	500-2000ms	Baseline
GPT-4o-mini API	$1,125	$13,500	200-800ms	85% cheaper
Llama-3.3-8B FP16 (A100)	$2,160	$25,920	50-150ms	71% cheaper
Llama-3.3-8B GPTQ-INT4 (A40)	$864	$10,368	30-80ms	88% cheaper
Llama-3.3-8B GGUF-Q4 (CPU)	$216	$2,592	200-500ms	97% cheaper

When Quantization Makes Sense

Always (immediate ROI): Production serving (>50 req/sec), public APIs, mobile deployment. Usually (weeks payback): >1M tokens/day in API costs. Maybe: Internal tools, batch processing. Maximum savings: Quantize to INT4 + serve on A40/T4 = 88-97% cheaper than premium API, with 3-10x lower latency. Quantization effort is hours; payback is days.

Failure Modes & Mitigation

Common Pitfalls and Recovery Strategies

Outlier Channels

Some weight channels have extreme outliers. Quantizing them directly causes massive quality loss.

Fix: AWQ handles this. Or use per-group quantization. Or skip extreme outliers.

Catastrophic Quality Loss at Q2

2-bit quantization is very aggressive. May cause 10-20% quality loss or complete failure on some models.

Fix: Test empirically. Use Q3+ for safety. Monitor perplexity.}

Calibration Data Mismatch

If calibration data is unrepresentative, quantization will be suboptimal for real data.

Fix: Use diverse, domain-representative data. 100-500 samples.

Hardware Mismatch

GPTQ quantized on GPU A may not work well on GPU B (different architectures, precision).

Fix: Quantize on target hardware. Test cross-hardware compatibility.

Kernel Not Available

GPTQ/AWQ require optimized kernels. Without them, inference is slow or impossible.

Fix: Use vLLM (handles kernels). Or use GGUF (universal).

Attention Score Overflow

INT8 attention can overflow during softmax (intermediate values exceed int range).

Fix: Use FP16 for attention. Or use INT4 with per-group scaling.

Mitigation Checklist

Before deploying: Test on target hardware. Measure perplexity & task performance. Use diverse calibration data. Monitor for OOM & overflow. Have FP16 fallback ready. Start with Q4 (safe), then try Q3 if needed.

Recovery Strategies

If quality drops: Increase bits (Q4→Q5), use per-group quantization, re-calibrate. If crashes: Use universal format (GGUF), fallback to FP16, reduce batch size. If slow: Check kernels available, use optimized framework (vLLM).

Tools & Frameworks

Software Ecosystem for Quantization

auto_gptq

GPTQ quantization library. Easy API, good docs. Supports 4-bit INT4 quantization.

pip install auto-gptq

autoawq

AWQ quantization library. Faster calibration, better quality. Best for production.

pip install autoawq

llama.cpp

CPU inference for GGUF models. Cross-platform, production-ready. No GPU needed.

git clone llama.cpp

vLLM

GPU inference engine. Supports GPTQ, AWQ. Best throughput on NVIDIA.

pip install vllm

Ollama

Easy GGUF model management. One-liner setup. Great for beginners.

brew install ollama

TensorRT-LLM

NVIDIA optimized engine. Best latency. Requires CUDA expertise.

pip install tensorrt-llm

bitsandbytes

HuggingFace INT8 quantization. Easy HF integration. Good for training.

pip install bitsandbytes

Intel Neural Compressor

Intel-optimized quantization. Great for CPU inference on x86.

pip install neural-compressor

ONNX Runtime

Cross-platform quantization and inference. Hardware-agnostic.

pip install onnxruntime

Tool Comparison Matrix

Tool	Purpose	Main Hardware	Ease of Use	Performance
auto_gptq	GPTQ quantization	GPU	Easy	Excellent
autoawq	AWQ quantization	GPU	Easy	Excellent
vLLM	GPU inference	NVIDIA GPU	Medium	Excellent
llama.cpp	CPU inference	CPU	Easy	Very Good
Ollama	Easy GGUF	CPU/Mac	Very Easy	Good
TensorRT-LLM	NVIDIA optimization	NVIDIA GPU	Hard	Best

Quick Start Stack

GPU server: vLLM + autoawq. Local CPU: Ollama. Custom training: auto_gptq + bitsandbytes. Edge/mobile: llama.cpp + GGUF.

Research Papers & References

Key Publications on LLM Quantization

Foundational Methods

GPTQ (2023)

Accurate Post-Training Quantization for Generative Pre-Trained Transformers
Frantar et al. (Qualcomm AI Research)
One-shot 4-bit quantization using Hessian. Foundation for modern quantization.

arXiv:2210.17323

AWQ (2024)

Activation-aware Weight Quantization for LLM Compression and Acceleration
Lin et al. (MIT-IBM, Microsoft)
Activation-aware scaling. <1% quality loss at Q4. Better than GPTQ.

arXiv:2306.00978

Advanced Methods

SqueezeLLM (2024)

Sensitive weight identification for 3-bit quantization. Extreme compression with minimal loss. Good for mobile.

QuIP# (2024)

2-bit quantization via vector quantization. Better than naive 2-bit. For extreme edge cases.

AQLM (2023)

Additive Quantization of Language Models. Product quantization approach for better quality at low bits.

BitNet (2024)

1-bit LLMs. Extreme compression, training from scratch. Vision of ultra-efficient models.

Supporting Research

Paper	Year	Focus	Key Contribution
Quantization and Training of NNs for Efficient Integer-Arithmetic Only Inference	2018	QAT	Foundation for fake quantization during training
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	2022	INT8	Row/column-wise INT8. Mixed-precision for outliers
ZipLM: Hardware-aware Structured Pruning of Language Models	2023	Pruning + Quantization	Combine quantization with structured sparsity
OliVe: Optimizing the Latency-Variance for Edge Inference	2023	Edge Deployment	Optimize for edge latency consistency

Benchmarks & Leaderboards

Open LLM Leaderboard

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

MTEB Benchmark

https://huggingface.co/spaces/mteb/leaderboard

Key Takeaways from Research

Trend: Lower bits possible with better methods (GPTQ→AWQ→SqueezeLLM). Quality: All modern methods preserve 90-99% quality. Future: 1-bit and 2-bit quantization becoming practical. BitNet shows potential for ultra-efficient models.

Glossary of Quantization Terms

21 key technical terms used throughout this guide, organized alphabetically.

A

Term	Definition
AWQ (Activation-Aware Weight Quantization)	A weight-only quantization method that protects salient weights by observing activation distributions rather than just weights. No backpropagation needed, less calibration data than GPTQ. Best throughput with Marlin kernels.

B

Term	Definition
BF16 (Brain Float 16)	16-bit format with 8 exponent bits (same range as FP32) and 7 mantissa bits. Better training stability than FP16 because it handles the same numeric range. Preferred for training.
Bitsandbytes	A HuggingFace-integrated library for 8-bit and 4-bit quantization. Provides load_in_8bit and load_in_4bit options for instant quantization of any model. Powers QLoRA.

C

Term	Definition
Calibration Data	A small representative dataset (128-1024 samples) used by PTQ methods to determine optimal quantization parameters. Quality of calibration data directly impacts quantized model quality.

D

Term	Definition
Dequantization	Converting quantized (low-precision) values back to higher precision for computation. Happens dynamically during inference for weight-only quantization methods.

F

Term	Definition
FP16 (Half Precision)	16-bit floating point format with 5 exponent bits and 10 mantissa bits. Halves memory vs FP32 with minimal quality loss. Standard for inference; requires loss scaling for training.

G

Term	Definition
GGUF	A file format for storing quantized models, optimized for CPU and Apple Silicon inference. Supports multiple quantization levels (Q2_K through Q8_0). Used by llama.cpp and Ollama.
GPTQ (Generative Pre-trained Transformer Quantization)	A one-shot weight quantization method using approximate second-order information (Hessian) for optimal rounding. First method to achieve 4-bit LLM quantization with minimal accuracy loss.
Group Quantization	Quantizing weights in groups (e.g., 128 elements share one scale/zero-point) rather than per-tensor or per-channel. Provides finer granularity and better quality than per-tensor quantization.

I

Term	Definition
INT4 / INT8	4-bit and 8-bit integer representations. INT8 provides ~4× compression with 1-2% quality loss. INT4 provides ~8× compression but may need QAT or careful calibration to maintain quality.

M

Term	Definition
Marlin Kernels	Optimized GPU kernels for quantized matrix multiplication, providing 2.6× speedup for GPTQ and 10.9× for AWQ over baseline implementations. The key to production quantized inference performance.
Mixed-Precision Quantization	Using different precision levels for different layers or components (e.g., INT8 for most layers, FP16 for sensitive attention layers). Balances compression and quality.

O

Term	Definition
Outlier Channels	Weight channels with extreme values that cause disproportionate quality degradation when quantized. AWQ specifically addresses this by protecting salient weights based on activation patterns.

P

Term	Definition
Per-Channel Quantization	Computing separate scale and zero-point values for each output channel of a weight matrix. More accurate than per-tensor but requires more storage for quantization parameters.
Post-Training Quantization (PTQ)	Quantizing a trained model without retraining. Fast (minutes to hours) but may have higher quality loss than QAT. Methods: GPTQ, AWQ, Round-To-Nearest.

Q

Term	Definition
QAT (Quantization-Aware Training)	Simulating quantization during training so the model learns to compensate for precision loss. Better quality than PTQ but requires full training infrastructure and compute.
Quantization	Reducing the numerical precision of model weights and/or activations to decrease memory footprint and increase inference speed. The most impactful single optimization for production LLM deployment.

S

Term	Definition
Scale Factor	A floating-point multiplier used to map between quantized integer values and their original floating-point range. Computed as (max - min) / (2^bits - 1).
SmoothQuant	A quantization technique that migrates quantization difficulty from activations to weights by applying mathematically equivalent per-channel scaling. Enables effective activation quantization.

W

Term	Definition
Weight-Only Quantization	Quantizing only model weights while keeping activations in higher precision (FP16). Simpler than weight+activation quantization and preserves most quality. Used by GPTQ and AWQ.

Z

Term	Definition
Zero-Point	An integer value representing the quantized equivalent of floating-point zero. Used in asymmetric quantization: q = round(x/scale + zero_point).

Full Reference: For a comprehensive glossary covering ALL LLM topics across all documents, see the unified LLM Glossary with 140+ terms.