What extended thinking is ยท How thinking tokens work ยท The budget_tokens parameter ยท When to use it vs. standard mode ยท Enterprise use cases ยท Cost management ยท Code examples ยท Common mistakes
What Is Claude Extended Thinking?
Claude extended thinking is a mode, available through the Claude API, that allocates a block of computation โ measured in tokens โ for the model to reason through a problem before producing its final answer. You can think of it as giving Claude a private scratchpad: it works through possibilities, checks its logic, and identifies errors before committing to a response.
Standard Claude API calls follow a single forward pass: receive prompt, generate tokens, return completion. Extended thinking adds a visible thinking phase. You control how large that phase can be by setting a budget_tokens parameter. Claude fills that budget thinking through the problem, and then produces a final answer. You can optionally expose the thinking content to users, or suppress it and return only the completed response.
The capability first shipped with Claude 3.7 Sonnet in early 2026 and represents Anthropic's implementation of chain-of-thought reasoning at the model level rather than the prompt level. You no longer need to engineer "think step by step" workarounds โ the architecture handles it natively. Enterprises using extended thinking for legal contract review, financial modelling, and complex code debugging have reported accuracy improvements of 20โ35% on their most difficult tasks.
How It Differs from Prompt-Level Chain-of-Thought
Prompt-level chain-of-thought (CoT) โ instructing the model to "think step by step" in your system prompt โ works, but it exposes reasoning in the output token stream and adds latency proportional to however verbose the model decides to be. Extended thinking runs in a separate block with its own token budget, does not interfere with your output format, and is architecturally more reliable because the model has been trained specifically to use this reasoning channel. It also allows the model to "backtrack" internally โ reconsidering earlier conclusions โ in ways that text-in-output reasoning cannot.
How Extended Thinking Tokens Work
When you enable extended thinking, the API response includes a thinking content block alongside the text content block. The thinking block contains the model's reasoning trace. Here is the basic API structure:
Ready to Deploy Claude in Your Organisation?
Our Claude Certified Architects have guided 50+ enterprise deployments. Book a free 30-minute scoping call to map your path from POC to production.
Book a Free Strategy Call โimport anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000
},
messages=[{
"role": "user",
"content": "Analyse this contract clause for ambiguity \
and potential liability exposure: [clause text]"
}]
)
# Access thinking content
for block in response.content:
if block.type == "thinking":
print("THINKING:", block.thinking)
elif block.type == "text":
print("RESPONSE:", block.text)
The budget_tokens value sets the upper bound on how many tokens the model can use for reasoning. It is not a guaranteed allocation โ if the model resolves the problem in 2,000 tokens of thinking when you set 10,000, it will stop early. The model is trained to use the budget efficiently rather than pad it.
Token Counting and Cost Implications
This is the part most teams get wrong. Thinking tokens count against your token usage. If you set budget_tokens to 10,000 and Claude uses 8,500 of them, those 8,500 tokens are billed at the same rate as input tokens on the model you are using. For Claude Sonnet, that is roughly $3 per million tokens as of early 2026. A call with 10,000 thinking tokens plus 2,000 input tokens plus 1,500 output tokens has a total token cost equivalent to ~13,500 tokens โ not ~3,500.
This makes extended thinking a deliberate architectural choice, not a default. Our Claude API integration service routinely identifies where teams are applying extended thinking indiscriminately and costing themselves 5โ10ร more than necessary. For a deep dive on controlling API costs, see our article on Claude prompt caching, which can reduce base costs by up to 90%.
Configuring budget_tokens: Practical Guidelines
Selecting the right budget_tokens value requires understanding your task complexity. Setting it too low means the model cuts off its reasoning mid-analysis; setting it too high wastes money on simple questions. Here are the ranges we recommend after deploying extended thinking across financial services, legal, and engineering organisations:
| Task Type | Recommended Budget | Example Task |
|---|---|---|
| Straightforward analysis | 2,000โ5,000 | Summarise a document with specific formatting requirements |
| Multi-step reasoning | 5,000โ10,000 | Contract clause risk analysis, debugging complex code logic |
| Deep problem-solving | 10,000โ20,000 | Regulatory compliance gap analysis, architectural decision review |
| Research synthesis | 20,000โ32,000 | Cross-document synthesis, complex financial modelling |
The minimum value for budget_tokens when extended thinking is enabled is 1,024 tokens. The maximum is model-dependent but is currently 32,000 on Claude Sonnet 4.6. Note that max_tokens must be set high enough to accommodate both the thinking block and the output; for most extended thinking calls, we recommend setting max_tokens to at least budget_tokens + 4,000.
Streaming with Extended Thinking
Extended thinking is fully compatible with streaming API patterns. When streaming is enabled, thinking block deltas arrive before text block deltas, allowing you to display a real-time "reasoning" indicator in your UI. Many enterprise legal and financial tools expose the thinking stream to analysts as a transparency feature โ showing them exactly how the model evaluated the problem increases trust and adoption.
Deploying Extended Thinking in Production?
Getting the architecture right โ budgets, routing logic, cost controls โ is the difference between a compelling demo and a scalable production system. Our Claude API integration team has built extended thinking pipelines across legal, financial services, and healthcare.
Book a Free Architecture Review โExtended Thinking Enterprise Use Cases
Not every task justifies extended thinking. The cases where it consistently outperforms standard mode involve multiple conflicting constraints, ambiguous evidence, or decisions that require ruling out alternatives before committing to an answer.
Legal: Contract Risk Analysis
A standard Claude call can identify obvious problematic clauses in a contract. Extended thinking allows the model to trace through the implications of each clause against the others โ identifying, for example, that an indemnification clause in section 8 contradicts a limitation-of-liability clause in section 14, creating a legal grey area that a quick read misses. Law firms deploying extended thinking for contract review report catching an additional 18โ22% of clause conflicts compared to standard mode. Our enterprise implementation service has deployed these pipelines at three Am Law 100 firms.
Finance: Financial Model Validation
When you ask Claude to validate a discounted cash flow model, extended thinking allows it to check discount rate assumptions against the risk-free rate, test sensitivity assumptions for internal consistency, flag inconsistencies between growth rate assumptions and capex projections, and consider alternative valuation methodologies before concluding. These are exactly the steps a trained financial analyst takes โ and extended thinking gives Claude the computation space to follow that same disciplined process.
Engineering: Complex Debugging
Production bugs in distributed systems often have no single root cause. Extended thinking allows Claude to reason through multiple failure hypotheses, eliminate each based on evidence in your logs and code, and arrive at the most likely cause โ rather than returning the first plausible explanation it encounters. Teams using extended thinking for incident postmortem analysis report resolving root cause significantly faster than without.
Compliance: Regulatory Gap Analysis
Mapping your data practices against GDPR, CCPA, and sector-specific regulations simultaneously requires holding multiple regulatory frameworks in mind and reasoning through how each applies to each practice. Extended thinking is purpose-built for this kind of multi-constraint reasoning. For enterprises in regulated sectors, this is one of the highest-value applications of the Claude API โ and one we cover in depth in our Claude API enterprise guide.
Extended Thinking vs. Standard Mode: When to Use Which
The most common mistake teams make is applying extended thinking everywhere after their initial experiments show quality improvements. This is expensive and counterproductive โ on simple tasks, extended thinking adds latency without meaningful accuracy gains.
Use standard mode for: content generation, document summarisation, classification tasks, translation, simple Q&A, search augmentation, and any task where the correct answer is largely determined by pattern-matching on the input rather than multi-step reasoning. For these workloads, standard mode is faster, cheaper, and produces equivalent quality.
Use extended thinking for: decisions with multiple conflicting constraints, analysis of ambiguous or incomplete evidence, validation tasks where you need the model to rule out alternatives, complex planning with interdependent steps, and any scenario where surface-level pattern-matching is insufficient. For AI agent development, extended thinking is particularly valuable at the planning step of multi-step agentic workflows โ allocating a higher reasoning budget at the "decide what to do next" stage rather than at every tool call.
Route tasks to extended thinking based on a complexity classifier. A simple binary classifier trained on your task types โ cheap, fast โ can decide whether to invoke extended thinking and at what budget. This approach typically reduces extended thinking costs by 60โ80% compared to applying it indiscriminately, while preserving quality on the tasks that need it.
Cost Management for Extended Thinking at Scale
Once you move extended thinking from a prototype to production, cost management becomes critical. Here are the four mechanisms we implement for every enterprise deployment.
1. Dynamic Budget Routing
Classify incoming requests by complexity before sending to the API. A simple prompt-length heuristic or a lightweight classifier routes straightforward requests to standard mode and only escalates to extended thinking for genuinely complex inputs. Pair this with a prompt caching strategy for any static content in your system prompt โ the combination of dynamic routing and caching is the single most effective cost reduction strategy.
2. Budget Caps by User Tier
If you are building a multi-tenant product, enforce budget caps per user tier. Enterprise-tier users might receive budget_tokens: 20000 by default; standard users might be capped at budget_tokens: 5000. This prevents a single power user from consuming a disproportionate share of your API budget on a single call.
3. Monitoring and Alerting
Track average thinking token usage per task type in your application metrics. A spike in average thinking token usage often signals that a task type that used to be straightforward has drifted โ perhaps because users are now submitting longer inputs or more ambiguous queries. Early detection prevents budget overruns.
4. Batch Processing for Non-Urgent Tasks
Extended thinking calls that do not require real-time responses โ overnight compliance scans, weekly contract reviews, batch document analysis โ should use the Batch API, which offers a 50% cost reduction. The combination of batch processing and extended thinking delivers the highest reasoning quality at the lowest per-call cost.
Working with Thinking Output in Your Application
The thinking block in the API response is a string containing the model's internal reasoning. You have three options for how to handle it in your product.
Suppress it entirely. If your users do not need to see the reasoning โ they just want the answer โ simply ignore the thinking block in your response parsing. It still contributes to accuracy, but does not appear in your UI. Most chat applications take this approach.
Expose it as a "Show Reasoning" toggle. High-trust professional applications โ legal review tools, financial analysis platforms, clinical decision support โ often display the thinking trace on demand. This builds user trust and allows domain experts to catch reasoning errors before acting on the output.
Use it in downstream processing. In complex agentic pipelines, you can parse the thinking block to extract intermediate conclusions and feed them into the next step of your workflow. For example, in a multi-step contract analysis pipeline, the thinking output from a clause-level analysis call might be condensed and included in the context for a high-level risk summary call.
# Example: Extracting and using thinking content downstream
thinking_content = ""
final_response = ""
for block in response.content:
if block.type == "thinking":
thinking_content = block.thinking
elif block.type == "text":
final_response = block.text
# Use thinking as context for a follow-up synthesis call
follow_up = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4000,
system="You are synthesising a legal risk report.",
messages=[
{
"role": "user",
"content": f"Based on this analysis:\n\n{thinking_content}\n\n"
f"And this conclusion:\n\n{final_response}\n\n"
"Write an executive summary for the general counsel."
}
]
)
Common Extended Thinking Mistakes (and How to Avoid Them)
After deploying extended thinking across a dozen enterprise systems, we have seen the same mistakes repeatedly. Here is the short list and the fix for each.
Mistake 1: Setting budget_tokens too low. If you set budget_tokens to 1,024 on a problem that needs 8,000 tokens of reasoning, the model truncates its thinking mid-analysis and produces an overconfident partial answer. Test with unconstrained budgets first, measure actual thinking token usage, then set your production budget at the 90th percentile of what your task type requires.
Mistake 2: Not adjusting max_tokens. If max_tokens is set lower than the sum of thinking tokens and output tokens, the API will return an error or a truncated response. Always set max_tokens generously when extended thinking is enabled โ a good default is budget_tokens + 8000.
Mistake 3: Applying it to classification tasks. Binary classification, sentiment analysis, and simple extraction tasks see negligible quality improvement from extended thinking at significant cost. These tasks are constrained by the information in the input, not by reasoning depth. Reserve extended thinking for genuinely open-ended reasoning.
Mistake 4: Ignoring the thinking content in debugging. When extended thinking produces a wrong answer, the thinking trace almost always shows exactly where the reasoning went wrong. Reading the thinking output is the fastest way to diagnose prompt issues and improve your system. If you are not logging thinking content during development, you are debugging blind.
If you are designing an extended thinking architecture and want a review before going to production, book a free call with our Claude Certified Architects. We have seen the full range of production patterns and can shortcut your evaluation cycle considerably.
Extended thinking allocates a dedicated reasoning budget before the final response โ improving accuracy on complex, multi-constraint tasks by 20โ35%. Set budget_tokens based on task complexity, not uniformly. Thinking tokens are billed as input tokens โ implement routing logic and prompt caching to control costs. Expose the thinking trace for high-trust professional applications; suppress it for consumer-facing chat. Use the Batch API for non-real-time extended thinking workloads to capture a 50% cost reduction.