← Back to Learning Hub

AI Agent Prompt Caching Layer

A production architecture for reducing LLM latency and cost by caching reusable prompt prefixes, context bundles, retrieval results, tool outputs, and deterministic agent responses without weakening correctness or tenant isolation.

Version: 1.0 Date: April 2026 Audience: AI platform, agent runtime, and LLMOps teams

1. Executive Overview

AI agents repeatedly send large, structurally similar prompts to language models: system instructions, developer policy, tool schemas, user profile, project state, retrieved documents, conversation summaries, and task plans. A prompt caching layer identifies stable prompt segments, stores them under deterministic cache keys, and reuses them across agent turns when the semantic and security conditions still hold.

Primary goalLower time-to-first-token and LLM spend while preserving correctness.
Secondary goalMake prompt construction observable, testable, and policy driven.
Non-goalDo not blindly cache every model response. Agent state changes too often for that to be safe.
Design principle: cache prompt components aggressively, cache final responses selectively, and treat every cache hit as a policy decision rather than a storage lookup.

2. Problem Statement

Agent prompts become expensive because each step reconstructs a large context window. The same stable blocks are often sent to the same model many times: tool specifications, repository map, coding conventions, persona, safety policy, structured output schema, and summarized long-term memory.

Common symptoms

  • High median and p95 latency from repeated large context uploads.
  • Unstable costs because every agent loop resends static instructions and tool schemas.
  • Prompt drift caused by ad hoc assembly logic spread across services.
  • Weak observability: teams cannot explain which context blocks dominated spend.
  • Low cache hit rates from naive keys that include volatile timestamps, trace IDs, or message ordering noise.

3. Scope

LayerExamplesCacheability
Prompt prefixSystem prompt, developer prompt, policy text, tool schemasHigh
Context bundleRepo summary, product docs, user profile, memory summaryMedium to high
Retrieval resultTop-k documents, reranked chunks, metadata filtersMedium
Tool outputSearch result, static API response, code analysis resultPolicy dependent
Model responseClassification, extraction, deterministic planning outputLow to medium
Streaming responseFree-form conversational answerUsually low

The layer should sit between the agent orchestrator and model gateway. It can also expose helper APIs to retrieval, memory, and tool services so those services can produce canonical cacheable artifacts.

4. Architecture

The caching layer sits on the request path between the agent orchestrator and the model gateway. The request path is fast and decision heavy; supporting services (policy, metadata, blob storage, invalidation bus) sit off the critical path so that a slow auxiliary system cannot block a model request.

REQUEST PATH (LEFT TO RIGHT) | CONTROL + STORAGE PLANE (BELOW) User / Caller Task input Agent Runtime State, plan, tools Prompt Builder Stable + dynamic blocks Canonicalizer Hash + normalize Cache Router Policy + tier select L1 Hot in-process L2 Shared redis / kv L3 Blob s3 / gcs Model Gateway Cache hints + retries Provider Adapter prefix reuse hints LLM Provider tokens + telemetry response + write-back eligible blocks Policy Service eligibility + ACL Metadata Store keys, tags, TTLs Invalidation Bus tag events Telemetry & Audit hit rate | tokens avoided | policy denials | quality regressions
request path cache tier response / write-back control plane

Core components

  • Prompt Builder: splits requests into stable prefix blocks and volatile task blocks.
  • Canonicalizer: removes non-semantic noise and serializes prompt blocks deterministically.
  • Cache Router: applies eligibility rules, chooses cache tier, and emits hit or miss decisions.
  • Metadata Store: tracks cache keys, owners, schema versions, TTLs, invalidation tags, and usage metrics.
  • Blob Store: stores large canonical prompt artifacts, retrieval bundles, and tool results.
  • Invalidation Bus: fans out tag events from source-of-truth services so dependent entries can expire precisely.
  • Provider Adapter: maps internal cache decisions to provider-specific prompt caching features where available.
  • Policy Service: single source of truth for cacheability, scope, PII class, and TTL ranges per entry type.
Why a control plane: separating policy and invalidation from the hot read path means a stale policy cannot stall a model request, and a slow metadata write cannot block returning a cache hit. Reads are eventually policy-consistent within the TTL window, and writes are best-effort with retry.

5. Cache Taxonomy

L1 Hot Cache Lives inside the agent worker and catches repeated blocks within one run. fastest, smallest
L2 Shared Cache Distributed store for reusable prompt blocks, retrieval results, and safe tool outputs. shared across workers
L3 Artifact Store Durable object storage for large context bundles referenced by hash and metadata. large, durable
Provider Cache LLM provider-side prefix reuse when the same stable prefix is sent repeatedly. billing dependent

L1: In-process hot cache

Small, short-lived cache for repeated operations inside a single agent run. It is useful for tool schemas, compiled prompt templates, and recently fetched memory summaries.

L2: Distributed cache

Shared cache across agent workers, usually backed by Redis, KeyDB, or a managed low-latency store. This layer handles canonical prompt blocks, retrieval responses, tool outputs, and model-response candidates.

L3: Durable artifact store

Object storage for large context bundles and long-lived artifacts. The distributed cache stores pointers, hashes, and metadata instead of copying large payloads into every entry.

Provider-side prompt cache

Some LLM providers can cache repeated prompt prefixes internally. The platform should still maintain its own eligibility, keying, and telemetry layer because provider semantics, retention windows, and billing behavior differ.

Latency budget by tier

Tier choice is driven mostly by latency. Each rung up the stack adds a network or storage hop. The numbers below are order-of-magnitude bands that can be expected on a co-located deployment; treat them as design budgets, not SLAs.

TYPICAL LOOKUP / SERVE LATENCY (LOG SCALE, MS) 0.01 0.1 1 10 100 1000 L1 Hot ~10 us - 0.5 ms (in-process) L2 Shared 0.5 - 3 ms (redis / kv) L3 Blob 5 - 40 ms (object storage) Provider Hit 80 - 300 ms TTFT Uncached 600 - 3000 ms TTFT Each tier should be ~10x faster than the next. Use that ratio to decide where to promote hot artifacts.
A cache tier only earns its complexity if it removes an order of magnitude of latency from the next tier.

6. Keying Strategy

Cache keys must represent the exact semantic contract of a prompt block. Include all fields that can change the model output, and exclude fields that only exist for tracing or transport.

Tenanttenant + workspace
Securityscope + ACL class
Modelfamily + version
Promptschema + block hash
Policytools + retrieval index
sha256(canonical semantic contract) = prompt_cache_key
cache_key = sha256(
  tenant_id + ":" +
  security_scope + ":" +
  model_family + ":" +
  model_version + ":" +
  tokenizer_version + ":" +
  prompt_schema_version + ":" +
  canonical_block_hash + ":" +
  policy_version + ":" +
  tool_schema_version + ":" +
  retrieval_index_version
)

Canonicalization rules

  • Serialize JSON with sorted keys and stable whitespace.
  • Normalize line endings, Unicode form, and markdown heading spacing.
  • Strip volatile values such as timestamps, request IDs, trace IDs, and ephemeral nonce values.
  • Preserve tool order if order affects model behavior; otherwise sort by stable tool name and version.
  • Hash large document bundles by content plus access-control scope, not by URL alone.
Do not share cache entries across tenants unless the entry is explicitly public, contains no private data, and has a separate public security scope.

7. Policies

Eligibility policy

  • Allow: static system prompts, developer prompts, safety policies, tool schemas, public documentation, deterministic extraction outputs.
  • Conditionally allow: retrieval results, user memory summaries, codebase maps, tool outputs with explicit freshness metadata.
  • Deny: secrets, credentials, payment data, raw sensitive user content, high-risk regulated data, and final answers for open-ended tasks.

Model response caching policy

Cache final model responses only when the request is deterministic, side-effect free, and stable under a low temperature. Examples include classification, schema extraction, routing, policy evaluation, and idempotent summarization.

{
  "cacheable": true,
  "entry_type": "model_response",
  "requires": {
    "temperature_lte": 0.2,
    "no_tool_side_effects": true,
    "output_schema_version": "agent.plan.v4",
    "pii_classification": "none"
  },
  "ttl_seconds": 3600
}

8. Freshness and Invalidation

Cache freshness should be expressed with TTLs and invalidation tags. Tags allow precise invalidation when a project, tool schema, prompt template, memory profile, retrieval index, or policy changes.

1. Write EntryStore payload hash, TTL, policy version, and invalidation tags.
2. Serve HitReturn only if ACL, policy, model, and freshness checks pass.
3. Source ChangesRepo, tool schema, memory, document index, or policy emits a tag event.
4. RebuildInvalidate or refresh the affected artifact before future agent turns.
ArtifactSuggested TTLInvalidation Tags
Tool schema block7-30 daystool:name:version, agent-runtime
System prompt prefix7-30 daysprompt-template, policy-version
Codebase summary1-24 hoursrepo, branch, commit
Retrieval result5-60 minutesindex-version, doc-collection
User memory summary5-30 minutesuser, memory-version
Model response1-60 minutesmodel, output-schema, task-policy

Stale-while-revalidate

For non-critical context such as public documentation snippets or repository maps, return a slightly stale entry immediately and refresh it asynchronously. Do not use stale entries for safety policy, access control, payment, legal, medical, or security-sensitive decisions.

9. Security and Privacy

Public Scope
  • Published docs
  • Generic tool schemas
  • Shared system templates
Tenant Private Scope
  • Workspace memory
  • Repository summaries
  • Tenant policy overlays
Never Cache
  • Secrets and credentials
  • Raw sensitive data
  • Untrusted content as instructions
  • Tenant isolation: namespace every key by tenant, workspace, and security scope.
  • Access control: recheck authorization before returning cached context, even on a cache hit.
  • Encryption: encrypt durable cache payloads at rest and use TLS for all cache transport.
  • PII handling: classify payloads before caching; deny or shorten TTLs for sensitive classes.
  • Prompt injection: cache retrieved content separately from trusted instructions and preserve trust labels.
  • Auditability: log cache decisions, policy versions, and invalidation events without logging raw secrets.
Trust boundary: cached retrieved documents remain untrusted content. Reusing them does not make them instructions.

10. API Design

Prompt cache lookup

POST /v1/prompt-cache/lookup
{
  "tenant_id": "t_123",
  "agent_id": "research-agent",
  "model": "provider/model-version",
  "blocks": [
    {
      "type": "system_prompt",
      "schema_version": "sys.v8",
      "content_hash": "sha256:...",
      "security_scope": "tenant_private",
      "tags": ["policy:v12", "tools:v5"]
    }
  ]
}

Response

{
  "request_id": "pc_req_01",
  "blocks": [
    {
      "cache_key": "pc_abc",
      "status": "hit",
      "provider_cache_hint": "prefix_reusable",
      "expires_at": "2026-04-25T19:00:00Z"
    }
  ]
}

Write-through update

PUT /v1/prompt-cache/entries/{cache_key}
{
  "payload_ref": "s3://prompt-cache/tenant/hash",
  "payload_hash": "sha256:...",
  "entry_type": "context_bundle",
  "ttl_seconds": 3600,
  "tags": ["repo:abc", "commit:def", "branch:main"],
  "policy": {
    "pii_classification": "none",
    "reuse_scope": "tenant_private"
  }
}

11. Data Model

interface PromptCacheEntry {
  cache_key: string;
  tenant_id: string;
  workspace_id?: string;
  entry_type:
    | "prompt_prefix"
    | "context_bundle"
    | "retrieval_result"
    | "tool_output"
    | "model_response";
  model_family?: string;
  model_version?: string;
  tokenizer_version?: string;
  content_hash: string;
  payload_ref?: string;
  payload_bytes?: number;
  security_scope: "public" | "tenant_private" | "workspace_private" | "user_private";
  pii_classification: "none" | "low" | "moderate" | "high";
  tags: string[];
  created_at: string;
  expires_at: string;
  last_accessed_at: string;
  hit_count: number;
  policy_version: string;
}

12. Agent Flow

  1. Agent runtime receives a task and loads agent policy, user context, memory, tools, and retrieval requirements.
  2. Prompt Builder separates stable blocks from dynamic task instructions.
  3. Canonicalizer computes semantic hashes for each block.
  4. Cache Router checks policy, access scope, TTL, and invalidation tags.
  5. On hit, the model request uses cached block references or provider-specific cache hints.
  6. On miss, the full block is sent and eligible artifacts are written back after success.
  7. Telemetry records latency, token usage, hit rate, avoided tokens, and correctness feedback.

Hit vs miss timeline

The savings of a cache hit come from two compounding effects: lookups are an order of magnitude faster than the provider call, and the avoided tokens shorten the model's prefill phase, which is the dominant contributor to time-to-first-token for long contexts.

Agent Cache Router L2 / L3 Policy LLM Provider CACHE HIT lookup(keys[]) ~1ms meta lookup cache_key + tags authorize(scope) allow request + prefix-reuse hint (avoided ~14k tokens) first token ~280 ms CACHE MISS lookup(keys[]) meta lookup miss full prompt (~16k tokens uploaded) first token ~1.4 s write-back eligible blocks (async)
Hit path completes the round trip while the miss path is still uploading the prompt prefix.

Failure behavior

Cache failures must degrade to a normal uncached model request. A cache outage should raise latency and cost, not break the agent workflow. The only exception is a policy service failure; when policy cannot be evaluated, use a deny-by-default posture for cache reads involving private data.

FailureBehaviorTelemetry
L1/L2 unavailableSkip tier; proceed to next tier or full requestcache.tier_skipped
Metadata staleTreat as miss; revalidate write-backcache.stale_meta
Policy service downDeny private-scope reads; allow public readspolicy.fail_closed
Provider cache rejectedResend full prefix; log hint mismatchprovider.hint_rejected
Write-back failureDrop write; do not retry on hot pathcache.writeback_dropped

13. Observability

Metrics

  • prompt_cache.hit_rate by entry type, model, tenant, and agent.
  • prompt_cache.tokens_avoided and prompt_cache.cost_avoided.
  • prompt_cache.lookup_latency_ms and prompt_cache.write_latency_ms.
  • prompt_cache.policy_denied_total by denial reason.
  • prompt_cache.stale_served_total by artifact type.
  • prompt_cache.correctness_regression_rate from evaluation feedback.

Logs and traces

Each model request should include trace spans for prompt assembly, canonicalization, lookup, provider request, response validation, and write-back. Logs should include hashes and metadata, not raw prompt content by default.

14. Evaluation

Prompt caching changes system behavior indirectly by altering prompt construction and freshness. Evaluation must compare cached and uncached runs for both cost and answer quality.

TestPurposePass Criteria
Golden task replayCompare cached vs uncached outputs for known agent tasksNo material quality regression
Freshness testModify source document or repo file and verify invalidationStale entry is not reused past policy
Tenant isolation testAttempt cross-tenant key reuseAll reads denied
Prompt injection testCache malicious retrieved contentTrust labels preserved; no instruction promotion
Load testMeasure cache overhead under production-like trafficLookup p95 below target budget

15. Cost Model: Worked Example

The economics of prompt caching depend on how much of each request is reused, the price ratio between cached and uncached input tokens, and the agent's loop depth. The example below uses a realistic coding agent profile to illustrate where the savings actually come from.

Baseline workload

  • Agent makes 8 model calls per task on average (planner + tool-call loops).
  • Per call: 14,000 tokens of stable context (system + tools + repo summary + memory) and 2,000 dynamic tokens (current task and turn-specific messages).
  • Output tokens: 500 per call.
  • Provider charges $3.00 per 1M input tokens, $0.30 per 1M cached input tokens, $15.00 per 1M output tokens.
  • Tasks per day: 10,000.

Per-call cost

ComponentTokensUncached costCached cost
Stable context (input)14,000$0.0420$0.0042
Dynamic input2,000$0.0060$0.0060
Output500$0.0075$0.0075
Total per call16,500$0.0555$0.0177

Savings at scale

68%cost per call
$0.30per task saved
$3,024per day saved
~5xTTFT improvement

At 10,000 tasks per day, the daily LLM bill drops from $4,440 to $1,416. That assumes a near-100% cache hit on the stable prefix; real deployments at this profile commonly reach 85-95% on the prefix block, scaling the savings proportionally.

Hit-rate threshold for break-even

Caching has a non-zero overhead: lookups, write-backs, storage, and bookkeeping. Let S be the per-call savings from a hit and O be the average per-call overhead added by the caching layer (read + occasional write). The minimum hit rate required for the layer to pay for itself is:

min_hit_rate = O / S

With S = $0.0378 (the difference above) and an O of about $0.0005 (CPU + Redis + storage amortized), the layer pays for itself once the hit rate clears ~1.4%. For latency, the equivalent break-even is even lower because L2 lookups are roughly 100x faster than the model TTFT they remove.

Where the savings actually live: the prefix dwarfs the dynamic content for most agent loops. Optimizing retrieval reranking or tool descriptions usually moves the bill more than tuning the model parameters does.

What does not save money

  • Caching output tokens of free-form chat - you pay full price every time and risk reusing a stale answer.
  • Caching tiny prefixes (under ~1k tokens) - the metadata round-trip costs more than the avoided prefill.
  • Caching with high prefix variance (request IDs, timestamps in the system prompt) - hit rate collapses to zero.

16. Provider Integration

Most major LLM providers now expose some form of prompt prefix caching. The semantics are not identical, so the Provider Adapter must translate the platform's internal cache decisions into provider-specific hints and reconcile the telemetry that comes back. The table below summarizes the key differences as of early 2026.

Provider feature Trigger Min prefix TTL Discount on cached input
Anthropic prompt caching Explicit cache_control markers on blocks ~1k tokens (model dependent) 5 min default; 1 hr extended ~90% off input
OpenAI automatic prompt caching Implicit on identical prefixes ~1k tokens ~5-10 min idle ~50% off input
Google Gemini context caching Explicit cached-content handle Higher minimum (~32k tokens) Configurable (1 hr default) Per-token storage fee + reduced input price
Self-hosted vLLM / TGI Automatic radix-tree prefix cache Any (page-aligned) Until evicted No billing impact; latency only

Adapter responsibilities

  • Block ordering: place reusable blocks at the very start of the prompt; provider caches are prefix-anchored.
  • Marker injection: add explicit cache markers (Anthropic) or rely on identical-prefix detection (OpenAI) per provider.
  • Handle lifecycle: create, refresh, and dispose Gemini cached-content handles; track storage cost.
  • Telemetry reconciliation: read provider cache_read_input_tokens / cache_creation_input_tokens and write them into platform metrics.
  • Hint failure handling: if a provider rejects a cache hint, fall back to a full prefix and emit provider.hint_rejected for observability.

Example: Anthropic block markers

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "{system_prompt}",
      "cache_control": { "type": "ephemeral" }
    },
    {
      "type": "text",
      "text": "{tool_schemas}",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "{dynamic_task_input}" }
  ]
}
Provider TTL is not your TTL. Provider caches expire on idleness independent of your invalidation tags. Treat provider hits as a latency optimization; rely on your own L2/L3 for guaranteed correctness windows.

17. Rollout Plan

Roll out in phases that progressively widen the eligibility surface. Each phase should ship behind a flag, run for at least a week, and clear quality gates before the next phase enables. Treat phase 1 as mandatory: shadow telemetry is the only honest way to size the project before committing to serve from cache.

Phase 1: Instrumentation only

Add canonicalization, key calculation, and would-hit telemetry without serving from cache. Use this phase to identify high-value prompt blocks and estimate savings.

Phase 2: Safe prompt-prefix caching

Enable static system prompt, developer prompt, tool schema, and structured output schema caching. Keep response caching disabled.

Phase 3: Context and retrieval caching

Add TTL-bound cache entries for repository summaries, document retrieval results, and memory summaries with strict invalidation tags.

Phase 4: Selective response caching

Enable final response caching only for deterministic subtasks such as classification, routing, extraction, and policy checks.

Phase 5: Optimization loop

Tune TTLs, admission policy, eviction policy, and provider-specific cache hints based on actual hit rates, cost avoided, stale-entry incidents, and quality evaluation results.

Recommended SLO: cache lookup p95 under 25 ms for metadata-only hits, zero known cross-tenant hits, and no measurable quality regression on golden task replay.

18. Anti-patterns and Pitfalls

Most prompt-cache failures are not infrastructure failures. They show up as silent quality regressions, runaway storage growth, or cache hit rates that look great in dashboards but never translate into latency or cost wins. The patterns below are the ones that cost teams the most time to diagnose.

Hashing volatile fields into the key Including a request ID, session ID, or timestamp in the canonical block guarantees a unique key on every call. The cache fills, the hit rate is zero, and storage cost climbs without any savings. Fix: strip non-semantic fields in the canonicalizer before hashing. Add a "would-have-hit" counter to detect drift early.
Caching tool outputs that have side effects Reusing a cached send_email or create_ticket response means the agent thinks the action succeeded without re-executing it. Side-effecting calls must be excluded by entry type, not by hand-maintained allowlists. Fix: declare side-effect class on every tool schema. Eligibility policy reads it directly; the cache router never sees the option.
Promoting cached retrieved content to instructions Retrieved documents are untrusted input. Caching them next to system instructions and concatenating them in the same message can let an injected directive survive across many agent turns. Fix: keep retrieval blocks in their own role/section with explicit trust labels. Never merge them into the system block.
Cross-tenant key collisions Hashing only the content - not the security scope - means two tenants with the same prompt prefix share an entry. That is a data exfiltration channel, not a cache hit. Fix: namespace every key by tenant + workspace + scope. Public artifacts use a separate, explicit "public" scope.
Long TTLs on volatile context A repository summary cached for 24 hours will surface stale function names and removed files. The agent will confidently call APIs that no longer exist. Fix: tag entries with the source commit/version and emit invalidation events on change. Default to short TTLs and require an explicit decision to extend.
Caching streaming free-form responses The temptation to cache a model's natural-language answer is strong. It almost never works - the next user phrases the question differently, or the conversation history shifts the meaning of "yes". Fix: only cache structured, schema-bound responses with a stable input contract (classification, extraction, routing).
Provider-cache-only thinking Relying solely on the LLM provider's prompt cache means giving up control over scope, TTL, invalidation, and tenancy. Provider caches also evict aggressively under idleness; a bursty workload will see far lower hit rates than the dashboards suggest. Fix: own the canonicalization and key layer in-platform. Use the provider cache as an additional accelerator, not the primary store.
Measuring hit rate without measuring quality A cache that returns slightly stale memory summaries will look fantastic in cost dashboards while quietly degrading the agent's task success rate. Fix: pair every hit-rate dashboard with a quality metric from the golden task replay set. Alert on quality drift even when cost metrics improve.
Rule of thumb: if you cannot articulate exactly which input fields define the cached entry's contract, do not cache it yet. Add the canonicalization step first, run in shadow mode, and only enable serving once the would-hit rate and key-stability metrics look healthy.