⚡ Reference Guide

Claude Rate Limits by Plan: API Quotas, Messages & Token Limits

Every rate limit Anthropic enforces, by model and tier — with real throughput numbers, what happens when you hit limits, and the architectural patterns that let production applications scale past them.

📅 Updated February 2026 ⏱ 10 min read 🏷 Reference · API · Architecture

Claude API Rate Limits by Tier — March 2026

Anthropic enforces rate limits at the API tier level, not per-model. Limits are measured in three dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Upgrading tiers increases all three.

API Tier Qualification RPM (claude-sonnet) Input TPM Output TPM Daily Token Limit
Free Tier
New API accounts, no spend
New account creation 5 RPM 25K ITPM 5K OTPM 300K / day
Build Tier 1
After first $5 spend
$5 total API spend 50 RPM 50K ITPM 10K OTPM 1M / day
Build Tier 2
30 days + $50 spend
30-day account + $50 spend 1,000 RPM 80K ITPM 16K OTPM 10M / day
Build Tier 3
90 days + $500 spend
90-day account + $500 spend 2,000 RPM 160K ITPM 32K OTPM 30M / day
Build Tier 4
After $5,000 spend
$5,000 cumulative API spend 4,000 RPM 400K ITPM 80K OTPM 300M / day
Scale / Enterprise
Negotiated
Custom commercial agreement Custom (10K+ RPM) Custom Custom Effectively unlimited

Note: RPM figures are approximate and Anthropic adjusts limits based on account standing, model selection, and capacity. Always check your actual limits via the Anthropic Console or the API response headers (x-ratelimit-limit-requests).

Rate Limits by Model (Build Tier 2 Example)

Different models have different per-model rate limits within the same API tier. Haiku has higher RPM allowances than Sonnet, which is higher than Opus. This reflects compute cost and capacity allocation.

Model RPM (Tier 2) Input TPM Output TPM Max Context / Call Max Output / Call
claude-haiku-4-5 1,000 RPM 800K ITPM 80K OTPM 200,000 tokens 8,192 tokens
claude-sonnet-4-6 1,000 RPM 80K ITPM 16K OTPM 200,000 tokens 8,192 tokens
claude-opus-4-6 500 RPM 40K ITPM 8K OTPM 200,000 tokens 8,192 tokens
claude-opus-4-6
Extended Thinking
200 RPM 40K ITPM 8K OTPM 200,000 tokens 64,000 tokens (incl. thinking)
Batch API (any model) 100 batches / day Unconstrained by RPM N/A 200,000 tokens / request 8,192 tokens

claude.ai Subscription Message Limits

Rate limits on claude.ai are measured in "messages" or "usage credits" rather than tokens. The limits reset every 8 hours on Pro and Max plans. Enterprise has no usage limits.

PlanDaily Messages (approx.)Reset PeriodOpus 4.6 AccessExtended Thinking
Claude Free ~20-30 / day
Varies by model & length
Daily reset Limited No
Claude Pro ($20/mo) ~5× Free
Priority queue access
8 hours Yes Yes (limited)
Claude Max ($100/mo) ~20× Free
Full Opus priority
8 hours Yes (priority) Yes
Claude Team ($30/user) ~5× Free per seat 8 hours Yes Yes
Claude Enterprise Unlimited N/A Yes Yes

Message counts are approximate. Anthropic does not publish exact message limits publicly — they depend on message length and model complexity. Longer messages with large file attachments consume more "credits" than short text queries.

6 Ways to Architect Around Rate Limits

Hitting rate limits in production is an architecture problem, not just a quota problem. These patterns eliminate bottlenecks without needing to upgrade tiers.

01

Request Queuing with Backpressure

Never send requests directly to the Claude API from your frontend or synchronous handlers. Use a queue (Redis, SQS, RabbitMQ) with a worker pool that respects RPM limits. When the queue is full, apply backpressure upstream. This absorbs traffic spikes without hitting rate limit errors.

02

Exponential Backoff on 429s

A 429 (Too Many Requests) response means you've hit a rate limit. Don't immediately retry — use exponential backoff with jitter. Start at 1 second, double each retry, add random 0-1 second jitter. Cap at 60 seconds. This prevents thundering herd re-requests from other clients.

03

Model Routing by Priority

Route high-priority, real-time requests to Sonnet. Route bulk, non-urgent tasks to Haiku (higher RPM budget). Route complex reasoning tasks to Opus but queue them aggressively. Each model has separate rate limit buckets — model routing is effectively tier expansion without tier upgrade. See our model selection guide.

04

Batch API for Non-Real-Time Workloads

The Batch API has no RPM limit — it runs asynchronously with 24-hour turnaround. Any workload that doesn't need immediate response (nightly reports, document processing, data enrichment) should use batch. This moves load off your real-time RPM quota entirely.

05

Response Caching for Repeated Queries

Identical or near-identical prompts produce similar outputs. Cache responses with a hash of the input prompt as the cache key. TTL of 1-24 hours depending on how time-sensitive the content is. For FAQ-type applications, 80%+ of queries may be serveable from cache — dramatically reducing live API calls.

06

Token Budget Enforcement

ITPM (input tokens per minute) is often the first limit hit, not RPM. Enforce a per-request token budget: cap system prompts at a maximum size, limit context retrieved from RAG, truncate conversation history beyond N turns. This lets you serve more requests within the same ITPM allocation. See our token management guide.

Reading Rate Limit Headers in Python

The Claude API returns rate limit status in response headers on every call. Read these to implement proactive throttling before you hit a 429 error.

# Python example: read rate limit headers from Claude API response
import anthropic
import time

client = anthropic.Anthropic()

def call_with_ratelimit_awareness(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    # Extract rate limit info from response headers
    headers = response.http_response.headers
    requests_remaining = int(headers.get('x-ratelimit-remaining-requests', 999))
    tokens_remaining = int(headers.get('x-ratelimit-remaining-tokens', 999999))
    reset_ms = int(headers.get('x-ratelimit-reset-requests', '1000').rstrip('ms'))

    # Proactive throttling: if <10% quota remaining, sleep until reset
    if requests_remaining < 5:
        sleep_seconds = reset_ms / 1000
        print(f"Low quota ({requests_remaining} req remaining). Sleeping {sleep_seconds:.1f}s")
        time.sleep(sleep_seconds)

    return response.content[0].text

What Happens When You Hit a Rate Limit?

When you exceed a rate limit, the Claude API returns an HTTP 429 Too Many Requests error with a JSON body indicating which limit was exceeded. The Retry-After header specifies how many seconds to wait before retrying.

Hitting rate limits in production is a reliability issue — users see errors or long delays. The correct response is not to immediately upgrade tiers but to first determine which limit you're hitting (RPM, ITPM, or OTPM), then apply architectural patterns to reduce pressure on that specific constraint before upgrading.

For high-scale deployments, our scaling and rate limiting guide covers production patterns in detail, including multi-account strategies (with Anthropic's permission), regional routing, and enterprise-tier negotiation for unlimited throughput.

Enterprise note: Anthropic's Scale and Enterprise tiers remove almost all practical rate limit constraints. If your organisation's use case requires consistent 500+ RPM on Sonnet, the commercial case for Enterprise is straightforward — the per-seat cost is offset by the operational overhead of managing rate limit architecture at scale. Contact our team for a Claude Enterprise cost modelling session.

Building a High-Volume Claude Application?

We've architected Claude integrations handling millions of API calls per month. Our API Integration team will design your rate-limit-aware architecture before you hit production bottlenecks.

Share: LinkedIn X / Twitter ✓ Copied!

Get the Claude Enterprise Weekly

Platform updates, deployment guides, and procurement intelligence — direct to your inbox every Tuesday.