Every rate limit Anthropic enforces, by model and tier — with real throughput numbers, what happens when you hit limits, and the architectural patterns that let production applications scale past them.
Anthropic enforces rate limits at the API tier level, not per-model. Limits are measured in three dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Upgrading tiers increases all three.
| API Tier | Qualification | RPM (claude-sonnet) | Input TPM | Output TPM | Daily Token Limit |
|---|---|---|---|---|---|
| Free Tier New API accounts, no spend |
New account creation | 5 RPM | 25K ITPM | 5K OTPM | 300K / day |
| Build Tier 1 After first $5 spend |
$5 total API spend | 50 RPM | 50K ITPM | 10K OTPM | 1M / day |
| Build Tier 2 30 days + $50 spend |
30-day account + $50 spend | 1,000 RPM | 80K ITPM | 16K OTPM | 10M / day |
| Build Tier 3 90 days + $500 spend |
90-day account + $500 spend | 2,000 RPM | 160K ITPM | 32K OTPM | 30M / day |
| Build Tier 4 After $5,000 spend |
$5,000 cumulative API spend | 4,000 RPM | 400K ITPM | 80K OTPM | 300M / day |
| Scale / Enterprise Negotiated |
Custom commercial agreement | Custom (10K+ RPM) | Custom | Custom | Effectively unlimited |
Note: RPM figures are approximate and Anthropic adjusts limits based on account standing, model selection, and capacity. Always check your actual limits via the Anthropic Console or the API response headers (x-ratelimit-limit-requests).
Different models have different per-model rate limits within the same API tier. Haiku has higher RPM allowances than Sonnet, which is higher than Opus. This reflects compute cost and capacity allocation.
| Model | RPM (Tier 2) | Input TPM | Output TPM | Max Context / Call | Max Output / Call |
|---|---|---|---|---|---|
| claude-haiku-4-5 | 1,000 RPM | 800K ITPM | 80K OTPM | 200,000 tokens | 8,192 tokens |
| claude-sonnet-4-6 | 1,000 RPM | 80K ITPM | 16K OTPM | 200,000 tokens | 8,192 tokens |
| claude-opus-4-6 | 500 RPM | 40K ITPM | 8K OTPM | 200,000 tokens | 8,192 tokens |
| claude-opus-4-6 Extended Thinking |
200 RPM | 40K ITPM | 8K OTPM | 200,000 tokens | 64,000 tokens (incl. thinking) |
| Batch API (any model) | 100 batches / day | Unconstrained by RPM | N/A | 200,000 tokens / request | 8,192 tokens |
Rate limits on claude.ai are measured in "messages" or "usage credits" rather than tokens. The limits reset every 8 hours on Pro and Max plans. Enterprise has no usage limits.
| Plan | Daily Messages (approx.) | Reset Period | Opus 4.6 Access | Extended Thinking |
|---|---|---|---|---|
| Claude Free | ~20-30 / day Varies by model & length |
Daily reset | Limited | No |
| Claude Pro ($20/mo) | ~5× Free Priority queue access |
8 hours | Yes | Yes (limited) |
| Claude Max ($100/mo) | ~20× Free Full Opus priority |
8 hours | Yes (priority) | Yes |
| Claude Team ($30/user) | ~5× Free per seat | 8 hours | Yes | Yes |
| Claude Enterprise | Unlimited | N/A | Yes | Yes |
Message counts are approximate. Anthropic does not publish exact message limits publicly — they depend on message length and model complexity. Longer messages with large file attachments consume more "credits" than short text queries.
Hitting rate limits in production is an architecture problem, not just a quota problem. These patterns eliminate bottlenecks without needing to upgrade tiers.
Never send requests directly to the Claude API from your frontend or synchronous handlers. Use a queue (Redis, SQS, RabbitMQ) with a worker pool that respects RPM limits. When the queue is full, apply backpressure upstream. This absorbs traffic spikes without hitting rate limit errors.
A 429 (Too Many Requests) response means you've hit a rate limit. Don't immediately retry — use exponential backoff with jitter. Start at 1 second, double each retry, add random 0-1 second jitter. Cap at 60 seconds. This prevents thundering herd re-requests from other clients.
Route high-priority, real-time requests to Sonnet. Route bulk, non-urgent tasks to Haiku (higher RPM budget). Route complex reasoning tasks to Opus but queue them aggressively. Each model has separate rate limit buckets — model routing is effectively tier expansion without tier upgrade. See our model selection guide.
The Batch API has no RPM limit — it runs asynchronously with 24-hour turnaround. Any workload that doesn't need immediate response (nightly reports, document processing, data enrichment) should use batch. This moves load off your real-time RPM quota entirely.
Identical or near-identical prompts produce similar outputs. Cache responses with a hash of the input prompt as the cache key. TTL of 1-24 hours depending on how time-sensitive the content is. For FAQ-type applications, 80%+ of queries may be serveable from cache — dramatically reducing live API calls.
ITPM (input tokens per minute) is often the first limit hit, not RPM. Enforce a per-request token budget: cap system prompts at a maximum size, limit context retrieved from RAG, truncate conversation history beyond N turns. This lets you serve more requests within the same ITPM allocation. See our token management guide.
The Claude API returns rate limit status in response headers on every call. Read these to implement proactive throttling before you hit a 429 error.
# Python example: read rate limit headers from Claude API response import anthropic import time client = anthropic.Anthropic() def call_with_ratelimit_awareness(prompt: str) -> str: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) # Extract rate limit info from response headers headers = response.http_response.headers requests_remaining = int(headers.get('x-ratelimit-remaining-requests', 999)) tokens_remaining = int(headers.get('x-ratelimit-remaining-tokens', 999999)) reset_ms = int(headers.get('x-ratelimit-reset-requests', '1000').rstrip('ms')) # Proactive throttling: if <10% quota remaining, sleep until reset if requests_remaining < 5: sleep_seconds = reset_ms / 1000 print(f"Low quota ({requests_remaining} req remaining). Sleeping {sleep_seconds:.1f}s") time.sleep(sleep_seconds) return response.content[0].text
When you exceed a rate limit, the Claude API returns an HTTP 429 Too Many Requests error with a JSON body indicating which limit was exceeded. The Retry-After header specifies how many seconds to wait before retrying.
Hitting rate limits in production is a reliability issue — users see errors or long delays. The correct response is not to immediately upgrade tiers but to first determine which limit you're hitting (RPM, ITPM, or OTPM), then apply architectural patterns to reduce pressure on that specific constraint before upgrading.
For high-scale deployments, our scaling and rate limiting guide covers production patterns in detail, including multi-account strategies (with Anthropic's permission), regional routing, and enterprise-tier negotiation for unlimited throughput.
Enterprise note: Anthropic's Scale and Enterprise tiers remove almost all practical rate limit constraints. If your organisation's use case requires consistent 500+ RPM on Sonnet, the commercial case for Enterprise is straightforward — the per-seat cost is offset by the operational overhead of managing rate limit architecture at scale. Contact our team for a Claude Enterprise cost modelling session.
We've architected Claude integrations handling millions of API calls per month. Our API Integration team will design your rate-limit-aware architecture before you hit production bottlenecks.