๐Ÿ“‹ In This Article

How prompt caching works ยท Cache pricing mechanics ยท Cache breakpoints ยท Minimum token requirements ยท Use case patterns ยท Multi-turn conversation caching ยท Measuring cache hit rates ยท Combining with batch API ยท Common mistakes

The Hidden Cost Driver in Most Claude API Deployments

Here is a cost pattern we see in almost every Claude API integration we audit: the team has a well-structured system prompt โ€” detailed role definition, output format instructions, domain-specific guidelines, maybe a few-shot examples โ€” that runs to 2,000โ€“10,000 tokens. Every single API call processes that system prompt from scratch. At $3 per million input tokens for Claude Sonnet, a 5,000-token system prompt across 100,000 monthly requests costs $1,500/month in system prompt processing alone. None of that cost produces new value after the first request.

Prompt caching eliminates this waste. Rather than reprocessing the same content on every call, Anthropic stores the intermediate computation โ€” the KV cache โ€” for the static portion of your prompt. Subsequent requests that share that prefix pay roughly 10% of the normal input token rate. The 90% reduction on cached tokens is not a theoretical maximum; it is the standard rate for cache reads in the Claude API as of early 2026.

90% Cost reduction on cached token reads vs. standard input processing
5 min Cache TTL โ€” how long a cached prompt prefix stays active
1,024 Minimum tokens required to create a cache breakpoint (Claude Sonnet/Haiku)

How Claude Prompt Caching Works

Prompt caching is based on a concept called prefix caching. When you process a prompt through the Transformer architecture that underlies Claude, each token's key-value vectors are computed based on all preceding tokens. For a prompt that begins with a 4,000-token system prompt followed by a 500-token user query, the computation for the 4,000-token prefix is the same on every call โ€” the system prompt has not changed. Prompt caching stores that prefix computation and reuses it.

Building on Claude? Get Architecture Guidance.

From prompt caching to multi-agent orchestration โ€” our engineers have built production Claude integrations across every major stack. Talk to us before you hit the hard problems.

Talk to a Claude Architect โ†’

From your application's perspective, you designate cache breakpoints using the cache_control parameter in your API request. Content before the breakpoint is eligible for caching; content after it (typically the dynamic user query) is processed normally on every call. The first request that sets a breakpoint pays full price to process and store the prefix. Subsequent requests that share the same prefix pay the reduced cache read rate.

The cache has a 5-minute time-to-live (TTL). Any request that hits the cached prefix within 5 minutes of the last access resets the TTL. For applications with consistent traffic โ€” which covers most production enterprise systems โ€” the cache stays warm indefinitely. For batch workloads with infrequent requests, you may need to architect around the TTL, which we cover below.

Cache Pricing Mechanics

There are three distinct rates for cached content on Claude Sonnet 4.6:

Cache write: When content is first cached, you pay a 25% premium on top of the standard input rate. For Sonnet at $3/M input tokens, cache writes cost $3.75/M tokens. This premium is charged once, on the first request.

Cache read: Every subsequent request that reads from cache pays 10% of the standard input rate โ€” $0.30/M tokens for Sonnet. This is the 90% reduction that makes prompt caching so powerful.

Non-cached input: Content not covered by a cache breakpoint โ€” typically the dynamic user message โ€” is billed at the standard $3/M input rate.

Output tokens: Output is unaffected by caching. Output tokens are always billed at the standard output rate regardless of cache status.

Implementing Cache Breakpoints

A cache breakpoint is created by adding "cache_control": {"type": "ephemeral"} to a content block in your API request. The prefix up to and including that block is eligible for caching. You can place up to four cache breakpoints per request, allowing you to cache different segments of a complex prompt independently.

Basic Breakpoint Implementation

import anthropic

client = anthropic.Anthropic()

# Single breakpoint โ€” cache the system prompt
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2000,
    system=[
        {
            "type": "text",
            "text": """You are a senior legal analyst specialising in M&A contract review.
Your role is to identify liability exposure, ambiguous terms, unfavourable
indemnification clauses, and non-standard representations and warranties.

Output format: JSON with keys: issues (array), risk_level (low/medium/high),
recommended_actions (array), executive_summary (string).

[... full 3,000 token system prompt ...]""",
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint here
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review this contract clause: [dynamic clause text]"
        }
    ]
)

# Check cache performance in response usage
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

The usage object in the response now includes three new fields: cache_creation_input_tokens (tokens written to cache on this request), cache_read_input_tokens (tokens read from cache), and the standard input_tokens (non-cached input). Monitoring these fields lets you measure your actual cache hit rate in production.

Multiple Breakpoints for Complex Prompts

# Multiple breakpoints โ€” cache static layers independently
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4000,
    system=[
        {
            "type": "text",
            "text": "[Role definition โ€” 800 tokens]",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 1
        },
        {
            "type": "text",
            "text": "[Output format instructions โ€” 600 tokens]",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 2
        },
        {
            "type": "text",
            "text": "[Today's regulatory context โ€” updates daily, not cached]"
            # No cache_control โ€” processed fresh each time
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "[Dynamic user query]"
        }
    ]
)

This pattern is useful when parts of your system prompt are truly static (role definition, output format) while others change on a schedule (daily regulatory updates, current market data, user-specific context). You cache only what is stable.

Are You Paying Full Price for Every Repeated System Prompt?

Most teams are. Our API integration team typically cuts 60โ€“80% off a client's Claude API bill within the first month by implementing prompt caching, batch routing, and model selection optimisation.

Book a Free Cost Review โ†’

Minimum Token Requirements and Cache Eligibility

Not every prompt qualifies for caching. Anthropic imposes minimum token requirements for cache breakpoints:

Claude Opus: 2,048 tokens minimum before a cache breakpoint.

Claude Sonnet and Haiku: 1,024 tokens minimum before a cache breakpoint.

If the content before your breakpoint is shorter than the minimum, the breakpoint is silently ignored and the content is processed at the standard rate. This catches teams off guard โ€” they add cache breakpoints expecting a discount, but their system prompt is only 600 tokens and caching is never invoked.

The practical implication: for short system prompts, prompt caching is not the right tool. Focus instead on model selection (using Haiku for simple tasks where Sonnet is overkill) and batching. For prompts that approach or exceed 1,024 tokens โ€” which describes most production enterprise prompts โ€” caching is almost always worth implementing.

Padding to Meet the Minimum: Worth It?

Some teams pad short system prompts with additional context or few-shot examples specifically to cross the 1,024-token threshold and enable caching. This is worth doing if the additional context improves response quality โ€” in which case you gain both quality and cost benefits simultaneously. It is not worth doing purely for caching if the added content does not improve outputs; the richer context may actually slow the model or introduce irrelevant guidance.

High-Value Prompt Caching Patterns for Enterprise

Prompt caching delivers the most value when the static portion of your prompt is both large and reused frequently. Here are the highest-value patterns we implement for enterprise clients.

Pattern 1: Large System Prompt Caching

The most common pattern. Any enterprise application with a detailed system prompt โ€” role definition, compliance instructions, output format, few-shot examples โ€” should cache the entire system prompt at startup. The cache write premium is paid once per 5-minute window; every subsequent call reads from cache. For a 5,000-token system prompt with 1,000 requests per hour, monthly cache savings on the system prompt alone approach $2,000 at Sonnet pricing.

Pattern 2: Document Context Caching

For document-centric workflows โ€” contract review, research synthesis, regulatory analysis โ€” you often attach a large document to the prompt and ask multiple questions about it. Without caching, each question reprocesses the entire document. With caching, set a breakpoint after the document content and each follow-up question pays only the cache read rate on the document itself.

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"[Full 8,000-token contract text]",
                "cache_control": {"type": "ephemeral"}  # Cache the document
            },
            {
                "type": "text",
                "text": "What are the key termination clauses?"
                # Dynamic question โ€” not cached
            }
        ]
    }
]

Pattern 3: Multi-Turn Conversation Caching

In multi-turn conversations, the conversation history grows with each turn. Without caching, each new turn reprocesses the entire conversation history. With caching, place a breakpoint at the end of the conversation history up to the penultimate message, so only the latest message is processed at full price. For long conversations โ€” common in enterprise research assistants and complex advisory tools โ€” this dramatically reduces per-turn cost as conversations mature.

Pattern 4: RAG Context Caching

In RAG (retrieval-augmented generation) systems, retrieved document chunks are injected into the prompt before the user query. For users who submit multiple queries against the same retrieved context โ€” as often happens in document analysis workflows โ€” caching the retrieved chunks eliminates repeated processing. Our RAG architecture guide covers this pattern in detail alongside vector store design and retrieval strategies.

Measuring and Optimising Cache Performance

Prompt caching only saves money when requests actually hit the cache. The hit rate depends on request frequency, cache TTL, and how consistently your prompts share the same prefix. Measuring your cache hit rate is essential โ€” otherwise you are guessing at whether your investment in caching architecture is paying off.

Tracking Cache Metrics

from collections import defaultdict
import time

cache_metrics = defaultdict(lambda: {"writes": 0, "reads": 0, "uncached": 0})

def track_usage(response, prompt_type: str):
    """Log cache performance from API response usage."""
    usage = response.usage
    cache_metrics[prompt_type]["writes"] += getattr(
        usage, "cache_creation_input_tokens", 0
    )
    cache_metrics[prompt_type]["reads"] += getattr(
        usage, "cache_read_input_tokens", 0
    )
    cache_metrics[prompt_type]["uncached"] += usage.input_tokens

def cache_hit_rate(prompt_type: str) -> float:
    """Calculate cache hit rate as a percentage."""
    m = cache_metrics[prompt_type]
    total_potential = m["writes"] + m["reads"]
    if total_potential == 0:
        return 0.0
    return (m["reads"] / total_potential) * 100

# Example reporting
for ptype, metrics in cache_metrics.items():
    rate = cache_hit_rate(ptype)
    print(f"{ptype}: {rate:.1f}% cache hit rate | "
          f"Reads: {metrics['reads']:,} | Writes: {metrics['writes']:,}")

A healthy cache hit rate for a production system with consistent traffic is 80โ€“95%. Rates below 60% typically indicate one of three problems: request frequency is too low to keep the cache warm (solution: consider pre-warming), the cacheable prefix is not consistent across requests (solution: review your prompt construction logic), or requests arrive in bursts that miss the 5-minute TTL window (solution: implement an active cache warm-up call at burst start).

Combining Prompt Caching with the Batch API

Prompt caching and the Batch API are complementary cost reduction strategies. The Batch API's 50% discount applies to all tokens including cached tokens โ€” meaning you pay 10% ร— 50% = 5% of the standard input rate for cached tokens in a batch request. For a pipeline with a 5,000-token cached system prompt, the effective cost per batch request on the cached portion is approximately $0.075 per million tokens โ€” a 97.5% reduction from the standard input rate.

This combination is the foundation of the cost architecture we deploy for every large-scale document processing pipeline. The rough cost math for a high-volume batch workflow with a large static system prompt: base rate $3/M โ†’ 50% batch discount โ†’ $1.50/M โ†’ 90% cache discount on static portion โ†’ $0.15/M effective rate on cached tokens. For processing millions of documents, this difference compounds to hundreds of thousands of dollars annually.

If you are running or planning high-volume Claude workloads and have not implemented both patterns, book a call with our team. We regularly conduct one-session cost architecture reviews that identify and quantify savings before a single line of code is changed.

Common Prompt Caching Mistakes

Mistake 1: Not meeting the minimum token threshold. The most common failure. Always verify that the content before your breakpoint exceeds 1,024 tokens (Sonnet/Haiku) or 2,048 tokens (Opus). Monitor cache_creation_input_tokens in the response โ€” if it is consistently zero, your breakpoints are not being registered.

Mistake 2: Varying the cacheable prefix. If you construct your system prompt dynamically โ€” for example, by injecting the current date, the user's name, or the specific tenant context โ€” you may accidentally invalidate the cache on every call. Move any dynamic content after the cache breakpoint, or into the user message. The prefix up to the breakpoint must be byte-for-byte identical across requests to hit the cache.

Mistake 3: Not warming the cache before batch jobs. If you run a nightly batch job that processes 10,000 documents, the first request in the batch writes the cache. If your job starts at midnight and the previous cache was written at 11:50 PM (within TTL), you are fine. If the last daytime request was at 6 PM, the cache is cold and the first batch request pays the write premium. For predictable batch workloads, send a single warm-up request immediately before starting the batch to ensure the cache is hot.

Mistake 4: Applying caching without measuring it. Teams implement prompt caching, see no cost reduction in their billing, and assume it is not working โ€” when in reality their request pattern is just too sparse to keep the cache warm. Always instrument your application to track cache_read_input_tokens separately from input_tokens. Without measurement, you are flying blind on your cost architecture.

Mistake 5: Forgetting about Opus minimums. Teams that develop on Sonnet (1,024 token minimum) and then upgrade to Opus in production (2,048 token minimum) may find their cache breakpoints are no longer triggered if their system prompt is between 1,024 and 2,048 tokens. Always re-verify caching configuration after a model upgrade.

For a comprehensive review of your current API architecture โ€” including prompt caching, batch routing, model selection, and extended thinking placement โ€” our Claude API integration service conducts a full cost and architecture audit as the first phase of every engagement.

โœ… Key Takeaways

Prompt caching reduces the cost of cached token reads to 10% of the standard input rate. Use cache_control: {type: "ephemeral"} to mark cache breakpoints in your system prompt or message content. The minimum for a valid breakpoint is 1,024 tokens (Sonnet/Haiku) and 2,048 tokens (Opus). The cache has a 5-minute TTL refreshed on each access. Combining prompt caching with the Batch API achieves effective rates as low as 5% of the standard input price on cached tokens. Always measure your cache hit rate via cache_read_input_tokens in the response usage.

CI
ClaudeImplementation Team

Claude Certified Architects with production deployments across financial services, legal, and healthcare. Learn about our team โ†’