Three Distinct Caching Layers

When engineers talk about Claude caching strategies, they are often conflating three completely different mechanisms that operate at different layers of the stack, have different implementation requirements, and deliver different types of savings. Clarifying these upfront prevents architectural mistakes that are expensive to undo.

~90%

Prompt Cache (Server-Side)

Anthropic's KV cache for prompt prefixes. Saves 90% on cached input tokens. Requires stable prompt prefix structure.

100%

Response Cache (Client-Side)

Exact or semantic match caching of Claude responses. Saves 100% on cache hits. Requires deterministic or near-deterministic queries.

40โ€“80%

Hybrid Architecture

Combines prompt and response caching for maximum savings across diverse query types.

Layer 1: Anthropic Prompt Caching

Anthropic's prompt caching works at the server level, caching the KV (key-value) computation state of your prompt prefix. When subsequent requests share the same prefix up to a cache breakpoint, the model reuses the cached state rather than recomputing it. The result is a 90% reduction in input token cost for the cached portion, billed at $0.30 per million tokens (vs $3.00 for Sonnet input tokens), and reduced latency on long contexts.

Claude prompt caching is enabled by adding cache control breakpoints to your prompt structure. The key constraint: the cached prefix must be at least 1,024 tokens. The cache TTL is 5 minutes by default (or up to 1 hour for Anthropic Enterprise tier). If requests to the same prefix arrive within the TTL window, they benefit from the cache. Requests arriving after TTL expiry pay the write cost once more.

import anthropic

client = anthropic.Anthropic()

# Example: Document Q&A with prompt caching
# The system + document content is cached; only the question changes

SYSTEM_PROMPT = """You are a financial document analysis assistant.
Extract specific information accurately and cite exact passages.
Always indicate confidence level and flag any ambiguities."""  # ~50 tokens

LARGE_DOCUMENT = """[10-K Filing โ€” Company XYZ โ€” FY 2025]
...[rest of 50-page document]..."""  # ~20,000 tokens โ€” this gets cached

def ask_about_document(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # Cache this block
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Document:\n{LARGE_DOCUMENT}",
                        "cache_control": {"type": "ephemeral"}  # Cache the doc
                    },
                    {
                        "type": "text",
                        "text": f"\nQuestion: {question}"  # This changes each request
                    }
                ]
            }
        ]
    )

    # Check cache performance
    usage = response.usage
    cache_read = getattr(usage, 'cache_read_input_tokens', 0)
    cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
    regular_input = usage.input_tokens

    if cache_read > 0:
        savings_pct = round(cache_read / (cache_read + regular_input) * 100)
        print(f"Cache hit: {cache_read:,} tokens read from cache ({savings_pct}% savings)")
    elif cache_write > 0:
        print(f"Cache write: {cache_write:,} tokens written to cache (first request)")

    return response.content[0].text

This pattern is ideal for document analysis, where the same document is queried multiple times in a session; knowledge base assistants, where a large system prompt with extensive instructions is reused across thousands of requests; and few-shot prompts where you have 20+ examples that are stable across requests.

Optimising Prompt Cache Hit Rate

Cache hit rate is the key metric for prompt caching ROI. If your cache hit rate is below 30%, you are paying write costs without capturing proportional savings. The common causes of low hit rate are: prompts that change too frequently (if even one token before the breakpoint changes, it is a cache miss); cache TTL expiry (for Anthropic's default 5-minute TTL, requests spaced more than 5 minutes apart with no intervening hits will expire the cache); and insufficient prefix length (below 1,024 tokens, caching is not eligible).

To maximise cache hit rate: structure your prompt so the stable portion (system prompt, large document, few-shot examples) comes first, and the variable portion (user query) comes last. Ensure the stable prefix is at least 2,000โ€“5,000 tokens โ€” larger prefixes deliver proportionally larger savings. For high-traffic applications, warming the cache with a dummy request during startup avoids the cold start penalty for the first real user.

Our detailed guide on Claude prompt caching implementation covers cache warming, multi-breakpoint strategies, and cache monitoring in depth.

Layer 2: Client-Side Response Caching

Response caching works entirely at the application layer: you store Claude's response for a given input and serve it from your cache on subsequent identical (or similar) requests, never hitting the Claude API at all. The saving is 100% of API cost for cache hits. The challenge is that this only works when responses are deterministic or when the same question genuinely warrants the same answer.

Exact Match Response Caching

The simplest response cache uses the full prompt as a cache key. Hash the system prompt, conversation history, and user message together, and store the response against that hash. On subsequent requests with an identical hash, return the cached response immediately. This works well for internal tools with templated queries, documentation Q&A where the same questions recur frequently, and classification or extraction tasks where the input is deterministic.

import hashlib
import json
import redis
from typing import Optional

class ClaudeResponseCache:
    def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
        self.cache = redis_client
        self.ttl = ttl_seconds

    def _make_key(self, model: str, system: str, messages: list) -> str:
        """Create a deterministic cache key from request parameters."""
        payload = json.dumps({
            "model": model,
            "system": system,
            "messages": messages
        }, sort_keys=True)
        return f"claude:resp:{hashlib.sha256(payload.encode()).hexdigest()}"

    def get(self, model: str, system: str, messages: list) -> Optional[str]:
        key = self._make_key(model, system, messages)
        cached = self.cache.get(key)
        if cached:
            return cached.decode()
        return None

    def set(self, model: str, system: str, messages: list, response: str) -> None:
        key = self._make_key(model, system, messages)
        self.cache.setex(key, self.ttl, response)

# Usage wrapper
response_cache = ClaudeResponseCache(redis.Redis())

def cached_claude_request(system: str, user_message: str, model: str = "claude-sonnet-4-6"):
    messages = [{"role": "user", "content": user_message}]

    # Check cache first
    cached = response_cache.get(model, system, messages)
    if cached:
        return cached, True  # (response, from_cache)

    # Cache miss โ€” hit the API
    api_response = client.messages.create(
        model=model, max_tokens=1024, system=system, messages=messages
    )
    response_text = api_response.content[0].text

    # Store in cache
    response_cache.set(model, system, messages, response_text)
    return response_text, False

Semantic Response Caching

Exact match caching misses questions that are semantically identical but phrased differently. "What is the refund policy?" and "Can I get a refund?" warrant the same answer, but their hashes differ. Semantic caching embeds the user query, searches for similar cached queries above a similarity threshold, and returns the cached response if a match is found.

The implementation requires a vector store (Pinecone, Weaviate, Qdrant, or pgvector), an embedding model (Claude's Voyage embeddings or a dedicated embedding model), and a similarity threshold tuned to your use case (typically 0.92โ€“0.97 cosine similarity). Semantic cache hit rates of 20โ€“40% are achievable in customer service, FAQ, and internal knowledge base applications โ€” eliminating 20โ€“40% of API calls entirely.

The trade-off: semantic caching introduces latency for the embedding lookup, and there is a staleness risk if the cached answer is correct semantically but outdated. For time-sensitive information, use short TTLs (1โ€“4 hours). For policy documents and stable knowledge bases, longer TTLs (24โ€“72 hours) are appropriate. See our Claude embeddings guide for embedding infrastructure details.

Need a caching architecture review for your Claude deployment?

Our Claude API integration team has designed caching systems that reduce API spend by up to 80% while maintaining response quality. Book a free 30-minute architecture review.

Book a Free Architecture Review โ†’

Layer 3: Hybrid Caching Architecture

The highest-performing Claude caching strategies combine all three layers into a unified caching architecture. The request flow works as follows: first check the semantic response cache (if exact or near-exact match, return immediately at zero API cost); if no cache hit, build the API request with prompt caching enabled (large stable prefix gets cached at Anthropic's layer, reducing input token cost by 85โ€“90%); after receiving the response, store it in the response cache for future identical queries.

This architecture delivers compounding savings. A semantic cache hit rate of 30% eliminates 30% of API calls entirely. Of the remaining 70% that reach the API, prompt caching reduces input token cost by 85% on average. The net result: effective API cost per request is roughly 20% of the naive, uncached baseline โ€” an 80% reduction.

Cache Invalidation Strategy

Cache invalidation is the hard part of any caching system, and Claude applications have specific invalidation requirements. Prompt caches invalidate automatically via TTL โ€” Anthropic handles this. Response caches need explicit invalidation when the underlying knowledge changes: when your FAQ document is updated, when your product pricing changes, when a policy document is revised. Build cache invalidation into your content management pipeline, not as an afterthought.

Use versioned cache keys that include a content hash of the underlying knowledge base. When the knowledge base is updated, the content hash changes, all existing cache entries become unreachable, and new entries are built from the updated content. This approach avoids stale cache hits without requiring explicit invalidation logic for individual cache entries.

When Not to Cache Claude Responses

Response caching is inappropriate for use cases where the response must reflect real-time data (current stock prices, live system status, recent news); where responses are personalised to individual users in ways that should not be shared; where the same query genuinely warrants different responses over time (advice that evolves with context); and where compliance requirements mandate fresh generation rather than cached responses.

Prompt caching alone (without response caching) is appropriate for essentially all production workloads with a stable prefix. There are no correctness risks, no staleness risks, and no compliance concerns โ€” you are still generating a fresh response, just with faster and cheaper prefix processing. If you are not using prompt caching on any application with a large stable system prompt, you are leaving money on the table.

Monitoring Your Claude Caching Strategy

Three metrics define caching health: prompt cache hit rate (from the API response usage data), response cache hit rate (from your application layer), and effective cost per request (accounting for both cache types). Track these over time โ€” a declining prompt cache hit rate suggests your prompt structure changed and broke the cached prefix; a declining response cache hit rate suggests your query distribution shifted or your TTL is too short.

Set up dashboards in your monitoring system that show these metrics by feature and environment. For a baseline, expect: prompt cache hit rate of 40โ€“80% for chat and RAG applications with large system prompts; response cache hit rate of 15โ€“35% for FAQ and knowledge base applications; and near-zero cache hit rates for creative generation and personalised content, where caching is not appropriate. See our Claude monitoring guide for the complete metrics architecture.

If your caching performance is below these baselines, the most common causes are: prompt structure that changes too frequently before the breakpoint (preventing prompt cache hits); response cache TTL that is shorter than the query recurrence interval; semantic similarity threshold set too high (missing near-identical queries); and missing cache warming for the first request after a deployment or cache cold start.

Related reading: See our guides on Claude cost optimisation at scale for the complete picture of cost reduction strategies, Claude token management for context window optimisation, and Claude prompt caching for the detailed implementation reference.
Key Takeaways
  • There are three distinct caching layers: Anthropic's server-side prompt cache, client-side response cache, and hybrid architectures.
  • Prompt caching requires a stable prefix of 1,024+ tokens and delivers 90% savings on the cached portion.
  • Semantic response caching can eliminate 20โ€“40% of API calls in FAQ and knowledge base applications.
  • Hybrid architectures combining both layers deliver 75โ€“85% total cost reduction versus uncached baselines.
  • Monitor prompt cache hit rate, response cache hit rate, and effective cost per request as first-class metrics.
โšก

ClaudeImplementation Team

Claude Certified Architects with deep expertise in API architecture and cost optimisation. Learn more โ†’