1. Executive Overview
AI agents repeatedly send large, structurally similar prompts to language models: system instructions, developer policy, tool schemas, user profile, project state, retrieved documents, conversation summaries, and task plans. A prompt caching layer identifies stable prompt segments, stores them under deterministic cache keys, and reuses them across agent turns when the semantic and security conditions still hold.
2. Problem Statement
Agent prompts become expensive because each step reconstructs a large context window. The same stable blocks are often sent to the same model many times: tool specifications, repository map, coding conventions, persona, safety policy, structured output schema, and summarized long-term memory.
Common symptoms
- High median and p95 latency from repeated large context uploads.
- Unstable costs because every agent loop resends static instructions and tool schemas.
- Prompt drift caused by ad hoc assembly logic spread across services.
- Weak observability: teams cannot explain which context blocks dominated spend.
- Low cache hit rates from naive keys that include volatile timestamps, trace IDs, or message ordering noise.
3. Scope
| Layer | Examples | Cacheability |
|---|---|---|
| Prompt prefix | System prompt, developer prompt, policy text, tool schemas | High |
| Context bundle | Repo summary, product docs, user profile, memory summary | Medium to high |
| Retrieval result | Top-k documents, reranked chunks, metadata filters | Medium |
| Tool output | Search result, static API response, code analysis result | Policy dependent |
| Model response | Classification, extraction, deterministic planning output | Low to medium |
| Streaming response | Free-form conversational answer | Usually low |
The layer should sit between the agent orchestrator and model gateway. It can also expose helper APIs to retrieval, memory, and tool services so those services can produce canonical cacheable artifacts.
4. Architecture
The caching layer sits on the request path between the agent orchestrator and the model gateway. The request path is fast and decision heavy; supporting services (policy, metadata, blob storage, invalidation bus) sit off the critical path so that a slow auxiliary system cannot block a model request.
Core components
- Prompt Builder: splits requests into stable prefix blocks and volatile task blocks.
- Canonicalizer: removes non-semantic noise and serializes prompt blocks deterministically.
- Cache Router: applies eligibility rules, chooses cache tier, and emits hit or miss decisions.
- Metadata Store: tracks cache keys, owners, schema versions, TTLs, invalidation tags, and usage metrics.
- Blob Store: stores large canonical prompt artifacts, retrieval bundles, and tool results.
- Invalidation Bus: fans out tag events from source-of-truth services so dependent entries can expire precisely.
- Provider Adapter: maps internal cache decisions to provider-specific prompt caching features where available.
- Policy Service: single source of truth for cacheability, scope, PII class, and TTL ranges per entry type.
5. Cache Taxonomy
L1: In-process hot cache
Small, short-lived cache for repeated operations inside a single agent run. It is useful for tool schemas, compiled prompt templates, and recently fetched memory summaries.
L2: Distributed cache
Shared cache across agent workers, usually backed by Redis, KeyDB, or a managed low-latency store. This layer handles canonical prompt blocks, retrieval responses, tool outputs, and model-response candidates.
L3: Durable artifact store
Object storage for large context bundles and long-lived artifacts. The distributed cache stores pointers, hashes, and metadata instead of copying large payloads into every entry.
Provider-side prompt cache
Some LLM providers can cache repeated prompt prefixes internally. The platform should still maintain its own eligibility, keying, and telemetry layer because provider semantics, retention windows, and billing behavior differ.
Latency budget by tier
Tier choice is driven mostly by latency. Each rung up the stack adds a network or storage hop. The numbers below are order-of-magnitude bands that can be expected on a co-located deployment; treat them as design budgets, not SLAs.
6. Keying Strategy
Cache keys must represent the exact semantic contract of a prompt block. Include all fields that can change the model output, and exclude fields that only exist for tracing or transport.
cache_key = sha256(
tenant_id + ":" +
security_scope + ":" +
model_family + ":" +
model_version + ":" +
tokenizer_version + ":" +
prompt_schema_version + ":" +
canonical_block_hash + ":" +
policy_version + ":" +
tool_schema_version + ":" +
retrieval_index_version
)
Canonicalization rules
- Serialize JSON with sorted keys and stable whitespace.
- Normalize line endings, Unicode form, and markdown heading spacing.
- Strip volatile values such as timestamps, request IDs, trace IDs, and ephemeral nonce values.
- Preserve tool order if order affects model behavior; otherwise sort by stable tool name and version.
- Hash large document bundles by content plus access-control scope, not by URL alone.
7. Policies
Eligibility policy
- Allow: static system prompts, developer prompts, safety policies, tool schemas, public documentation, deterministic extraction outputs.
- Conditionally allow: retrieval results, user memory summaries, codebase maps, tool outputs with explicit freshness metadata.
- Deny: secrets, credentials, payment data, raw sensitive user content, high-risk regulated data, and final answers for open-ended tasks.
Model response caching policy
Cache final model responses only when the request is deterministic, side-effect free, and stable under a low temperature. Examples include classification, schema extraction, routing, policy evaluation, and idempotent summarization.
{
"cacheable": true,
"entry_type": "model_response",
"requires": {
"temperature_lte": 0.2,
"no_tool_side_effects": true,
"output_schema_version": "agent.plan.v4",
"pii_classification": "none"
},
"ttl_seconds": 3600
}
8. Freshness and Invalidation
Cache freshness should be expressed with TTLs and invalidation tags. Tags allow precise invalidation when a project, tool schema, prompt template, memory profile, retrieval index, or policy changes.
| Artifact | Suggested TTL | Invalidation Tags |
|---|---|---|
| Tool schema block | 7-30 days | tool:name:version, agent-runtime |
| System prompt prefix | 7-30 days | prompt-template, policy-version |
| Codebase summary | 1-24 hours | repo, branch, commit |
| Retrieval result | 5-60 minutes | index-version, doc-collection |
| User memory summary | 5-30 minutes | user, memory-version |
| Model response | 1-60 minutes | model, output-schema, task-policy |
Stale-while-revalidate
For non-critical context such as public documentation snippets or repository maps, return a slightly stale entry immediately and refresh it asynchronously. Do not use stale entries for safety policy, access control, payment, legal, medical, or security-sensitive decisions.
9. Security and Privacy
- Published docs
- Generic tool schemas
- Shared system templates
- Workspace memory
- Repository summaries
- Tenant policy overlays
- Secrets and credentials
- Raw sensitive data
- Untrusted content as instructions
- Tenant isolation: namespace every key by tenant, workspace, and security scope.
- Access control: recheck authorization before returning cached context, even on a cache hit.
- Encryption: encrypt durable cache payloads at rest and use TLS for all cache transport.
- PII handling: classify payloads before caching; deny or shorten TTLs for sensitive classes.
- Prompt injection: cache retrieved content separately from trusted instructions and preserve trust labels.
- Auditability: log cache decisions, policy versions, and invalidation events without logging raw secrets.
10. API Design
Prompt cache lookup
POST /v1/prompt-cache/lookup
{
"tenant_id": "t_123",
"agent_id": "research-agent",
"model": "provider/model-version",
"blocks": [
{
"type": "system_prompt",
"schema_version": "sys.v8",
"content_hash": "sha256:...",
"security_scope": "tenant_private",
"tags": ["policy:v12", "tools:v5"]
}
]
}
Response
{
"request_id": "pc_req_01",
"blocks": [
{
"cache_key": "pc_abc",
"status": "hit",
"provider_cache_hint": "prefix_reusable",
"expires_at": "2026-04-25T19:00:00Z"
}
]
}
Write-through update
PUT /v1/prompt-cache/entries/{cache_key}
{
"payload_ref": "s3://prompt-cache/tenant/hash",
"payload_hash": "sha256:...",
"entry_type": "context_bundle",
"ttl_seconds": 3600,
"tags": ["repo:abc", "commit:def", "branch:main"],
"policy": {
"pii_classification": "none",
"reuse_scope": "tenant_private"
}
}
11. Data Model
interface PromptCacheEntry {
cache_key: string;
tenant_id: string;
workspace_id?: string;
entry_type:
| "prompt_prefix"
| "context_bundle"
| "retrieval_result"
| "tool_output"
| "model_response";
model_family?: string;
model_version?: string;
tokenizer_version?: string;
content_hash: string;
payload_ref?: string;
payload_bytes?: number;
security_scope: "public" | "tenant_private" | "workspace_private" | "user_private";
pii_classification: "none" | "low" | "moderate" | "high";
tags: string[];
created_at: string;
expires_at: string;
last_accessed_at: string;
hit_count: number;
policy_version: string;
}
12. Agent Flow
- Agent runtime receives a task and loads agent policy, user context, memory, tools, and retrieval requirements.
- Prompt Builder separates stable blocks from dynamic task instructions.
- Canonicalizer computes semantic hashes for each block.
- Cache Router checks policy, access scope, TTL, and invalidation tags.
- On hit, the model request uses cached block references or provider-specific cache hints.
- On miss, the full block is sent and eligible artifacts are written back after success.
- Telemetry records latency, token usage, hit rate, avoided tokens, and correctness feedback.
Hit vs miss timeline
The savings of a cache hit come from two compounding effects: lookups are an order of magnitude faster than the provider call, and the avoided tokens shorten the model's prefill phase, which is the dominant contributor to time-to-first-token for long contexts.
Failure behavior
Cache failures must degrade to a normal uncached model request. A cache outage should raise latency and cost, not break the agent workflow. The only exception is a policy service failure; when policy cannot be evaluated, use a deny-by-default posture for cache reads involving private data.
| Failure | Behavior | Telemetry |
|---|---|---|
| L1/L2 unavailable | Skip tier; proceed to next tier or full request | cache.tier_skipped |
| Metadata stale | Treat as miss; revalidate write-back | cache.stale_meta |
| Policy service down | Deny private-scope reads; allow public reads | policy.fail_closed |
| Provider cache rejected | Resend full prefix; log hint mismatch | provider.hint_rejected |
| Write-back failure | Drop write; do not retry on hot path | cache.writeback_dropped |
13. Observability
Metrics
prompt_cache.hit_rateby entry type, model, tenant, and agent.prompt_cache.tokens_avoidedandprompt_cache.cost_avoided.prompt_cache.lookup_latency_msandprompt_cache.write_latency_ms.prompt_cache.policy_denied_totalby denial reason.prompt_cache.stale_served_totalby artifact type.prompt_cache.correctness_regression_ratefrom evaluation feedback.
Logs and traces
Each model request should include trace spans for prompt assembly, canonicalization, lookup, provider request, response validation, and write-back. Logs should include hashes and metadata, not raw prompt content by default.
14. Evaluation
Prompt caching changes system behavior indirectly by altering prompt construction and freshness. Evaluation must compare cached and uncached runs for both cost and answer quality.
| Test | Purpose | Pass Criteria |
|---|---|---|
| Golden task replay | Compare cached vs uncached outputs for known agent tasks | No material quality regression |
| Freshness test | Modify source document or repo file and verify invalidation | Stale entry is not reused past policy |
| Tenant isolation test | Attempt cross-tenant key reuse | All reads denied |
| Prompt injection test | Cache malicious retrieved content | Trust labels preserved; no instruction promotion |
| Load test | Measure cache overhead under production-like traffic | Lookup p95 below target budget |
15. Cost Model: Worked Example
The economics of prompt caching depend on how much of each request is reused, the price ratio between cached and uncached input tokens, and the agent's loop depth. The example below uses a realistic coding agent profile to illustrate where the savings actually come from.
Baseline workload
- Agent makes 8 model calls per task on average (planner + tool-call loops).
- Per call: 14,000 tokens of stable context (system + tools + repo summary + memory) and 2,000 dynamic tokens (current task and turn-specific messages).
- Output tokens: 500 per call.
- Provider charges $3.00 per 1M input tokens, $0.30 per 1M cached input tokens, $15.00 per 1M output tokens.
- Tasks per day: 10,000.
Per-call cost
| Component | Tokens | Uncached cost | Cached cost |
|---|---|---|---|
| Stable context (input) | 14,000 | $0.0420 | $0.0042 |
| Dynamic input | 2,000 | $0.0060 | $0.0060 |
| Output | 500 | $0.0075 | $0.0075 |
| Total per call | 16,500 | $0.0555 | $0.0177 |
Savings at scale
At 10,000 tasks per day, the daily LLM bill drops from $4,440 to $1,416. That assumes a near-100% cache hit on the stable prefix; real deployments at this profile commonly reach 85-95% on the prefix block, scaling the savings proportionally.
Hit-rate threshold for break-even
Caching has a non-zero overhead: lookups, write-backs, storage, and bookkeeping. Let S be the per-call savings
from a hit and O be the average per-call overhead added by the caching layer (read + occasional write). The
minimum hit rate required for the layer to pay for itself is:
min_hit_rate = O / S
With S = $0.0378 (the difference above) and an O of about $0.0005 (CPU + Redis + storage amortized), the layer pays for itself once the hit rate clears ~1.4%. For latency, the equivalent break-even is even lower because L2 lookups are roughly 100x faster than the model TTFT they remove.
What does not save money
- Caching output tokens of free-form chat - you pay full price every time and risk reusing a stale answer.
- Caching tiny prefixes (under ~1k tokens) - the metadata round-trip costs more than the avoided prefill.
- Caching with high prefix variance (request IDs, timestamps in the system prompt) - hit rate collapses to zero.
16. Provider Integration
Most major LLM providers now expose some form of prompt prefix caching. The semantics are not identical, so the Provider Adapter must translate the platform's internal cache decisions into provider-specific hints and reconcile the telemetry that comes back. The table below summarizes the key differences as of early 2026.
| Provider feature | Trigger | Min prefix | TTL | Discount on cached input |
|---|---|---|---|---|
| Anthropic prompt caching | Explicit cache_control markers on blocks |
~1k tokens (model dependent) | 5 min default; 1 hr extended | ~90% off input |
| OpenAI automatic prompt caching | Implicit on identical prefixes | ~1k tokens | ~5-10 min idle | ~50% off input |
| Google Gemini context caching | Explicit cached-content handle | Higher minimum (~32k tokens) | Configurable (1 hr default) | Per-token storage fee + reduced input price |
| Self-hosted vLLM / TGI | Automatic radix-tree prefix cache | Any (page-aligned) | Until evicted | No billing impact; latency only |
Adapter responsibilities
- Block ordering: place reusable blocks at the very start of the prompt; provider caches are prefix-anchored.
- Marker injection: add explicit cache markers (Anthropic) or rely on identical-prefix detection (OpenAI) per provider.
- Handle lifecycle: create, refresh, and dispose Gemini cached-content handles; track storage cost.
- Telemetry reconciliation: read provider
cache_read_input_tokens/cache_creation_input_tokensand write them into platform metrics. - Hint failure handling: if a provider rejects a cache hint, fall back to a full prefix and emit
provider.hint_rejectedfor observability.
Example: Anthropic block markers
{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "{system_prompt}",
"cache_control": { "type": "ephemeral" }
},
{
"type": "text",
"text": "{tool_schemas}",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "{dynamic_task_input}" }
]
}
17. Rollout Plan
Roll out in phases that progressively widen the eligibility surface. Each phase should ship behind a flag, run for at least a week, and clear quality gates before the next phase enables. Treat phase 1 as mandatory: shadow telemetry is the only honest way to size the project before committing to serve from cache.
Phase 1: Instrumentation only
Add canonicalization, key calculation, and would-hit telemetry without serving from cache. Use this phase to identify high-value prompt blocks and estimate savings.
Phase 2: Safe prompt-prefix caching
Enable static system prompt, developer prompt, tool schema, and structured output schema caching. Keep response caching disabled.
Phase 3: Context and retrieval caching
Add TTL-bound cache entries for repository summaries, document retrieval results, and memory summaries with strict invalidation tags.
Phase 4: Selective response caching
Enable final response caching only for deterministic subtasks such as classification, routing, extraction, and policy checks.
Phase 5: Optimization loop
Tune TTLs, admission policy, eviction policy, and provider-specific cache hints based on actual hit rates, cost avoided, stale-entry incidents, and quality evaluation results.
18. Anti-patterns and Pitfalls
Most prompt-cache failures are not infrastructure failures. They show up as silent quality regressions, runaway storage growth, or cache hit rates that look great in dashboards but never translate into latency or cost wins. The patterns below are the ones that cost teams the most time to diagnose.
send_email or create_ticket response means the agent thinks the action
succeeded without re-executing it. Side-effecting calls must be excluded by entry type, not by hand-maintained allowlists.
Fix: declare side-effect class on every tool schema. Eligibility policy reads it directly; the cache router never sees the option.