Claude Token Management: Context Windows, Counting & Optimisation Strategies

Claude Context Windows in 2026

Claude token management begins with understanding the context window limits for each model and what those limits mean in practice. The context window defines the maximum combined size of input plus output in a single API request — system prompt, conversation history, retrieved documents, tool results, and generated response all count against it.

Model	Context Window	Max Output Tokens	Typical Use Case
claude-opus-4-6	200,000 tokens	32,000 tokens	Long document analysis, complex reasoning
claude-sonnet-4-6	200,000 tokens	64,000 tokens	Production workloads, code generation
claude-haiku-4-5	200,000 tokens	8,192 tokens	Classification, extraction, short generation

200,000 tokens is approximately 150,000 words — enough for a full-length novel. For most enterprise applications, hitting the context window limit in a single request is not the primary concern. The primary concern is managing the cumulative token growth of multi-turn conversations, RAG pipelines that append retrieved content to every message, and agentic workflows that accumulate tool results across many steps.

For Claude API integration in production, the context window is most often a cost and latency concern rather than a hard limit concern. Every token in the context window is paid for, even if Claude's attention is not focused on that part of the window. Long contexts also increase latency — a 150,000-token context takes longer to process than a 10,000-token context. Token management is therefore primarily an economic and performance discipline.

Accurate Token Counting Before You Send

Anthropic's API includes a token counting endpoint that lets you calculate the exact token count of a request before sending it. This is essential for applications that need to validate context size, implement dynamic truncation, or calculate expected costs before committing to an API call.

import anthropic

client = anthropic.Anthropic()

def count_tokens(messages: list, system: str = None, model: str = "claude-sonnet-4-6") -> int:
    """Count tokens for a message list before sending the request."""
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": 1,  # Not used for counting, but required
    }
    if system:
        params["system"] = system

    response = client.messages.count_tokens(**params)
    return response.input_tokens

# Usage
system_prompt = "You are a financial analysis assistant..."
messages = [
    {"role": "user", "content": "Analyse the following 10-K filing: " + filing_text}
]

token_count = count_tokens(messages, system=system_prompt)
print(f"Request will consume: {token_count:,} tokens")
print(f"Estimated cost: ${token_count * 3.0 / 1_000_000:.4f} (Sonnet input rate)")

# Check if we need to truncate
MAX_SAFE_TOKENS = 180_000  # Leave 20K for response
if token_count > MAX_SAFE_TOKENS:
    print(f"Warning: context is {token_count - MAX_SAFE_TOKENS:,} tokens over safe limit")
    # Apply truncation strategy

The token counting endpoint is charged at a minimal rate and is far less expensive than sending an oversized request that fails or produces a truncated response. Build token counting into any application that handles variable-length user inputs or dynamically assembled contexts.

How Claude Tokenises Text

Claude uses a BPE (Byte Pair Encoding) tokeniser that is similar to but not identical to the GPT-4 tokeniser. The practical implications for token counting estimates:

English prose tokenises at roughly 0.75 tokens per word (approximately 1 token per 4 characters). Technical content — code, JSON, XML, URLs — tokenises less efficiently, often at 1.5–2 tokens per word equivalent. Non-English languages tokenise at 1.5–3x the rate of English for many European languages, and 3–5x for languages with non-Latin scripts. Whitespace, newlines, and formatting characters all consume tokens. A heavily formatted markdown document will tokenise significantly more expensively than the same content as plain prose.

This matters practically: if you are injecting JSON payloads, XML documents, or code into your prompts, your actual token consumption will significantly exceed a naive word-count estimate. Always use the counting endpoint rather than heuristic estimates for production token budget management.

Multi-Turn Conversation Token Management

Multi-turn chat applications accumulate context with every message. If you are appending full conversation history to every request, a 30-turn conversation becomes an expensive context window that grows linearly with each turn. Without management, you eventually hit the context window limit and the application fails — often unexpectedly in production, during a user's longest conversation.

Strategy 1: Rolling Window

Keep only the N most recent messages in the context. The simplest approach: keep the system prompt plus the last 10 messages. The risk is losing important context from earlier in the conversation — the user's initial request, their stated preferences, or established facts they should not need to repeat.

Strategy 2: Progressive Summarisation

After every 5–10 turns, summarise the conversation so far using claude-haiku-4-5 and replace the message history with the summary. The summary takes one message position and costs a fraction of the tokens of the original history. This preserves semantic content at lower token cost. Implementation requires a background summarisation step that does not block the main conversation flow.

async def compress_conversation_history(
    messages: list[dict],
    summarise_after: int = 8,
    keep_recent: int = 4
) -> list[dict]:
    """Compress conversation history when it grows too long."""
    if len(messages) <= summarise_after + keep_recent:
        return messages  # No compression needed yet

    # Split: history to summarise + recent messages to keep verbatim
    to_summarise = messages[:len(messages) - keep_recent]
    recent = messages[len(messages) - keep_recent:]

    # Summarise older messages with Haiku (cheap)
    summary_prompt = (
        "Summarise this conversation concisely. "
        "Preserve: key decisions, user preferences, established facts, "
        "and any important context the assistant will need.\n\n"
        + "\n".join(f"{m['role']}: {m['content']}" for m in to_summarise)
    )
    summary_response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=500,
        messages=[{"role": "user", "content": summary_prompt}]
    )
    summary_text = summary_response.content[0].text

    # Replace history with summary
    compressed = [
        {"role": "user", "content": f"[Conversation summary: {summary_text}]"},
        {"role": "assistant", "content": "Understood. Continuing from where we left off."}
    ] + recent

    return compressed

Strategy 3: Semantic Memory Injection

For long-running assistant applications, implement an explicit memory layer. Extract key facts and preferences from each conversation turn and store them in a structured format (a simple database table or a vector store). At the start of each request, retrieve and inject only the relevant memories. This decouples conversation length from context size — a user can have a 1,000-turn history, and their context window stays small.

Our article on Claude memory and personalisation covers the implementation of persistent memory systems in detail.

Token Management for RAG Pipelines

RAG applications have two distinct token management problems: how many chunks to retrieve, and how large each chunk should be. Both have direct cost and quality implications.

The instinct to retrieve more chunks ("more context = better answers") is frequently wrong. Each retrieved chunk occupies context window space that costs money per request. Retrieving 10 chunks when 3 would suffice adds roughly 2,000–4,000 tokens per request at no quality benefit. Worse, irrelevant chunks degrade quality by diluting the signal in the context window — Claude's attention gets distributed across more content, reducing focus on the most relevant passages.

The right approach is aggressive reranking: retrieve 20 candidate chunks, rerank by relevance to the specific query (using a cross-encoder or a dedicated reranking model), and pass only the top 3–5 to Claude. This produces better quality answers at lower token cost. Our RAG architecture guide covers reranking implementation in detail.

Chunk size also matters. Most teams use a fixed chunk size of 512–1,024 tokens. Smaller chunks improve retrieval precision but may miss context that spans chunk boundaries. Larger chunks improve contextual coherence but reduce retrieval signal-to-noise. The optimal chunk size is task-specific, and the only way to find it is evaluation on your specific dataset. Build token counting into your chunk selection to verify your actual context size before each request.

Building RAG or agentic pipelines with context management challenges?

Our Claude API integration service includes full context window management architecture. We have designed token management strategies for RAG systems processing millions of documents daily.

Book a Free Consultation →

Token Management in Agentic Workflows

Agentic Claude applications — where Claude makes multiple tool calls, receives tool results, and produces a final synthesis — accumulate context across every step of the workflow. A 10-step agentic task with 2,000-token tool results per step adds 20,000 tokens of tool output to the context before Claude writes its final response. Without management, complex agents hit context limits on long-running tasks.

The key design principle is result compression: when a tool result is verbose, extract only the relevant portion before inserting it into the context. A database query returning 100 rows when only 5 are relevant should be reduced to those 5 rows before the result is added to the conversation. A web page scraped for research should be summarised to the relevant paragraphs. This requires building a layer between tool execution and context injection — more engineering work upfront, but essential for long-running agent stability.

For complex multi-agent architectures, context management is a first-class architectural concern. Each sub-agent should operate on a focused context slice relevant to its task, and only the relevant output should be passed to the orchestrating agent. Our multi-agent systems guide covers this architecture in depth.

Monitoring Token Usage in Production

Every Claude API response includes usage metadata in the response object. Capture it on every request and store it tagged with feature, model, and user/tenant identifiers. This is the data that tells you where your token spend is going and whether your context management strategies are working.

Key metrics to track: average input tokens per request by feature, output/input token ratio by feature (a high ratio suggests outputs are longer than necessary), cache hit rate if prompt caching is enabled, and p99 input token count (to identify requests approaching context limits before they fail). See our Claude monitoring and observability guide for the full metrics architecture.

Set alerts on average input token growth over time for chat applications. If average input tokens per conversation turn are growing week-over-week without a corresponding growth in request volume, your conversation history management is not working. This is the early warning signal before you start hitting context window limits or unexpected cost spikes.

Key Takeaways

All Claude models support 200,000-token context windows in 2026; the constraint is cost and latency, not hard limits.
Use the token counting endpoint for variable-content requests — heuristic estimates diverge significantly for code, JSON, and non-English text.
Multi-turn conversations need active context management (summarisation or rolling window) to prevent unbounded growth.
RAG pipelines should rerank before passing chunks to Claude — more chunks does not mean better quality.
Track average input tokens per request in production; unexpected growth is an early warning sign of context management failure.

🔢

ClaudeImplementation Team

Claude Certified Architects specialising in production API architecture for enterprise. Learn more →