Key Takeaways

Prompt caching stores processed input tokens server-side for 5 minutes, reusable at 10% of the normal input cost
A single cache_control: {"type": "ephemeral"} breakpoint activates caching — it's a 3-line code change
Best ROI when the same large context block (system prompt, document corpus, conversation history) repeats across many requests
Cache hits reduce time-to-first-token by 85% — critical for interactive user-facing applications
Supported on Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 with different minimum token thresholds

What Is Claude Prompt Caching and Why Does It Matter

When you call the Claude API, every input token you send gets processed from scratch — your system prompt, your document context, your conversation history. For a customer support agent handling 50,000 conversations per day, all sharing the same 2,000-token system prompt, you're paying to re-process those 2,000 tokens 50,000 times. That's 100 million tokens per day of redundant computation.

Prompt caching solves this. By adding a cache_control marker in your API request, you tell Anthropic's servers to store the processed representation of that context block. Subsequent requests that include the same prefix hit the cache — they're charged at 10% of the standard input token cost and served with 85% lower latency. The cache lives for 5 minutes per use and resets on each hit (so active sessions stay cached indefinitely).

The financial impact at enterprise scale is significant. An application sending 10 million input tokens per day with a 70% cache hit rate goes from $150/day to roughly $37/day in input costs — a 75% reduction. Combined with batch API for non-real-time workloads, prompt caching is the most impactful cost engineering available in the Claude API today.

How Prompt Caching Works

Caching in the Claude API works on prefix matching. The cache stores the key-value (KV) computation for a contiguous block of tokens from the start of your prompt. For a cache hit to occur on a new request, the new request must begin with the exact same token sequence as the cached prefix — down to the character. Any change to the cached portion invalidates the cache.

This has an important architectural implication: stable content must come before variable content. Your system prompt (stable) should come before user-specific context (variable). Your document corpus (stable across a session) should come before the user's question (variable). Structure your prompts so the highest-volume static content is always at the top, and user-specific inputs always come last.

Minimum Token Requirements

Not every request is worth caching. The Claude API enforces minimum token thresholds before caching takes effect — smaller prompts wouldn't provide meaningful savings given the overhead. The current minimums are 1,024 tokens for Claude Haiku 4.5, and 2,048 tokens for Sonnet and Opus models. If your cacheable block is smaller than this threshold, you pay standard input pricing regardless of the cache_control marker.

Model	Min Cache Tokens	Standard Input Cost	Cache Write Cost	Cache Read Cost
Claude Haiku 4.5	1,024 tokens	$0.80 / MTok	$1.00 / MTok	$0.08 / MTok
Claude Sonnet 4.6	2,048 tokens	$3.00 / MTok	$3.75 / MTok	$0.30 / MTok
Claude Opus 4.6	2,048 tokens	$15.00 / MTok	$18.75 / MTok	$1.50 / MTok

Implementation: The 3-Line Code Change

Implementing prompt caching requires a single additional field in your API request. Here's the before and after for a typical enterprise application — a document analysis assistant with a large system prompt:

before-caching.py — standard API call (no caching)

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are an expert financial analyst specialising in M&A due diligence. "
           "Your role is to analyse financial documents and identify material risks, "
           "revenue quality issues, working capital anomalies, and EBITDA adjustments. "
           # ... 3,000 more tokens of system prompt ...
           "Always cite specific line items and financial statement references.",
    messages=[
        {"role": "user", "content": f"Analyse this financial statement: {document_text}"}
    ]
)
# Every call re-processes the entire system prompt — expensive at high volume

after-caching.py — with prompt caching enabled

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are an expert financial analyst specialising in M&A due diligence. "
                    "Your role is to analyse financial documents and identify material risks, "
                    "revenue quality issues, working capital anomalies, and EBITDA adjustments. "
                    # ... 3,000 more tokens of system prompt ...
                    "Always cite specific line items and financial statement references.",
            "cache_control": {"type": "ephemeral"}  # ← This is the only change
        }
    ],
    messages=[
        {"role": "user", "content": f"Analyse this financial statement: {document_text}"}
    ]
)
# First call: cache write (1.25x standard cost)
# All subsequent calls within 5 minutes: cache read (0.10x standard cost)

The cache_control: {"type": "ephemeral"} field tells Claude to cache everything up to that breakpoint. On the first call, Anthropic writes the cache (charged at 1.25x standard input pricing). On every subsequent call with the same prefix, you pay 0.10x — a 90% reduction. The cache automatically refreshes its 5-minute TTL on every hit, so an active session stays cached as long as requests keep coming.

Advanced Caching Patterns

Pattern 1: Multi-Level Caching

The Claude API supports up to 4 cache breakpoints per request. This enables multi-level caching: a persistent system prompt, a session-level document corpus, and a conversation history — all independently cached.

multi-level-caching.py

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": GLOBAL_SYSTEM_PROMPT,  # Stable across all users — cache level 1
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        # Session document corpus — stable within a session — cache level 2
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Document corpus for this session:\n\n{session_documents}",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "assistant", "content": "I've reviewed the documents. Ask me anything."},

        # Conversation history — grows over session — cache level 3
        *format_conversation_history(previous_turns),

        # Current user question — always variable, never cached
        {
            "role": "user",
            "content": current_user_question
        }
    ]
)

# Result:
# - System prompt: cache hit after first call (saves ~$0.003 per call on Sonnet)
# - Document corpus: cache hit within session (saves ~$0.030-$0.150 per call)
# - Conversation history: cache hit for returning turns
# - User question: always fresh — minimal tokens, no caching needed

Pattern 2: RAG with Cached Retrieved Context

For RAG applications where the same documents get retrieved repeatedly (common in HR chatbots, knowledge bases, and compliance tools), you can cache the retrieved document chunks. The key is to sort your retrieved chunks in a deterministic order — same query should produce the same ordered chunk list. Then place a cache breakpoint after the document context, before the user question.

This pattern pairs especially well with the Claude HR Policy Q&A chatbot architecture — policy documents change infrequently, so your cache hit rate on the retrieved context should exceed 70% in steady-state operation.

Pattern 3: Large Document Analysis Pipelines

If you're processing a large document (an M&A data room, a legal contract set, an annual report) with multiple analysis passes — first extracting financials, then identifying risks, then summarising — cache the document on the first pass. Every subsequent analysis call on the same document pays cache read pricing. For a 100-page contract (approximately 30,000 tokens), this reduces input costs for passes 2-N by 90%.

Getting 90% Cost Reduction in Production

Our Claude API integration team audits your existing implementation and identifies every cacheable block. We typically find 60-80% cost reduction opportunities in the first audit — usually billable back in under a month.

Book a Cost Audit Call →

Measuring Cache Performance

The Claude API response includes cache performance metrics in the usage object. You should log these for every request to measure your actual cache hit rate and validate your caching architecture.

        cache-monitoring.py
        import anthropic
import logging

logger = logging.getLogger(__name__)

def call_claude_with_cache_monitoring(
    system_prompt: str,
    user_message: str,
    session_id: str
) -> anthropic.types.Message:

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": user_message}]
    )

    # Log cache performance
    usage = response.usage
    cache_write_tokens = getattr(usage, 'cache_creation_input_tokens', 0)
    cache_read_tokens = getattr(usage, 'cache_read_input_tokens', 0)
    uncached_tokens = usage.input_tokens

    was_cache_hit = cache_read_tokens > 0

    logger.info({
        "session_id": session_id,
        "cache_hit": was_cache_hit,
        "cache_write_tokens": cache_write_tokens,
        "cache_read_tokens": cache_read_tokens,
        "uncached_input_tokens": uncached_tokens,
        "output_tokens": usage.output_tokens,
        # Calculate actual cost vs theoretical uncached cost
        "cost_savings_pct": (cache_read_tokens * 0.90) / (cache_read_tokens + uncached_tokens) * 100 if (cache_read_tokens + uncached_tokens) > 0 else 0
    })

    return response

# Track aggregate metrics in your monitoring system
# Alert if cache hit rate drops below 60% for applications expecting high reuse
      

When Prompt Caching Won't Help

Prompt caching has real constraints. Understanding them prevents the common mistake of adding cache_control everywhere and wondering why your bill doesn't decrease.

Low-volume applications won't benefit from caching because the 5-minute TTL means the cache expires between requests. If your application sends one request per hour per user, the cache is never warm when the next request arrives. You're paying cache write pricing without ever paying cache read pricing — the opposite of the intended result.

Highly variable prompts won't cache because the prefix changes on every call. If your system prompt includes the current timestamp, the user's name, or any other dynamic content, caching never triggers. Move all dynamic content below the cache breakpoint — put only truly static content above it.

Short prompts below the minimum threshold are not cached regardless of your cache_control markers. If your system prompt is 800 tokens and you're using Sonnet (2,048 token minimum), add more static context to push above the threshold — or switch to Haiku if 1,024 tokens is sufficient for your use case.

For a full picture of when to use streaming, batching, and caching in combination, see our Claude streaming vs. batching guide and our comprehensive Claude API pricing and cost optimisation guide.

Enterprise Considerations: Security and Isolation

Cache isolation is a common question from enterprise security teams. Each organisation's cache is isolated — a cache created by your API key cannot be read by any other API key or organisation. Within your organisation, cache entries are keyed to the exact prompt prefix and do not leak between customers or sessions unless you're using identical prompts (which is the case only for your shared system prompt, where leakage would be a non-event since all users share the same system prompt by design).

For applications where different users should not share cached context — for example, if you're caching customer-specific documents — ensure that customer-specific content is placed before the cache breakpoint so different customers have different cache prefixes. This is more expensive (each customer's first request writes a separate cache entry) but guarantees isolation.

See our Claude Security & Governance service page for how we help enterprises design caching architectures that satisfy information security requirements, and our Claude AI governance framework guide for the broader policy context.

Real-World Numbers from Production Deployments

To make the business case concrete: here's what prompt caching actually delivers across three application types we've deployed.

Application Type	Daily Requests	Cached Tokens / Req	Cache Hit Rate	Monthly Savings
HR Policy Chatbot (Haiku 4.5)	2,400	1,800 tokens	82%	~$28/month
Legal Contract Analysis (Sonnet 4.6)	180	45,000 tokens	75%	~$1,850/month
Financial Report Summarisation (Opus 4.6)	60	90,000 tokens	68%	~$4,200/month

The legal contract and financial analysis cases represent a specific pattern: loading a large document corpus once and running multiple analysis passes. Even with relatively modest daily volumes, the savings are substantial because the cached context is so large per request.

For our Claude API integration service, identifying and implementing prompt caching is always the first cost optimisation we apply. It typically reduces a client's monthly API bill by 40-65% within the first sprint.

⚡

ClaudeImplementation Team

Claude Certified Architects. We've reduced enterprise Claude API bills by $50,000+/month through caching and architecture optimisation. Learn more →

How to Implement Claude Prompt Caching for High-Volume Enterprise Applications