CCA Domain 1: Claude API & Application Architecture Study Guide

Domain 1: Claude API & Application Architecture

~20% of CCA exam — approximately 12 questions

Domain 1 of the Claude Certified Architect (CCA) exam tests your ability to reason about Claude as a production API — not just a chat interface. Questions in this domain are architectural in nature: given a set of requirements, which API features, patterns, and configurations produce the correct result? You will not be asked to recall syntax. You will be asked to think like an architect.

This study guide covers every testable concept in Domain 1, including the model tier decision framework, token economics, streaming architecture, prompt caching mechanics, structured output patterns, error handling, and the principles of production-grade API design. Pair this with the 50 CCA practice questions (10 of which target this domain) and the complete CCA exam guide for full preparation.

Model Selection: The Tiered Architecture Decision

The CCA exam treats model selection as an architecture decision, not a preference. The three Claude model families — Opus, Sonnet, and Haiku — are not interchangeable. They represent different points on the cost-quality-speed tradeoff curve, and choosing correctly requires understanding both the workload characteristics and the business constraints.

Claude Opus 4 is the flagship reasoning model. It excels at tasks requiring multi-step reasoning, nuanced judgment, strategic synthesis, and complex document analysis. It is the correct choice when output quality directly affects business outcomes and when errors are costly. It carries the highest token cost and highest latency of the three tiers.

Claude Sonnet 4.5 occupies the middle tier: strong reasoning with better throughput and lower cost than Opus. It is appropriate for customer-facing applications, content generation, code review, and moderate-complexity analysis. In most enterprise deployments, Sonnet handles the majority of workload.

Claude Haiku 4.5 is optimised for speed and cost at high volume. Classification tasks, routing, intent detection, simple summarisation, and extraction tasks with well-defined output formats are Haiku's wheelhouse. The performance delta between Haiku and Sonnet on well-specified simple tasks is minimal; the cost delta is significant at scale.

Exam Pattern — Tiered Orchestration

Questions will give you a system with multiple task types and ask you to assign the right model to each
The correct answer always matches model capability to task complexity and volume
Watch for "Opus for everything" distractors — this is always wrong for production economics
Watch for "Haiku for everything" distractors — this is wrong when tasks require genuine reasoning

Model	Best For	Avoid When	Cost Relative
Opus 4	Complex reasoning, strategy, nuanced judgment	High-volume, simple tasks	Highest
Sonnet 4.5	Most production use cases; customer-facing agents	Ultra-high volume classification	Mid
Haiku 4.5	Classification, routing, extraction, high volume	Complex reasoning required	Lowest

Token Economics and Cost Architecture

A CCA candidate must understand how token costs accumulate in a production system, not just what tokens are. The exam will present scenarios where you must identify the most cost-effective architecture — and the answer is rarely "use a cheaper model."

Prompt Caching

Prompt caching is one of the highest-leverage cost reduction tools in the Claude API. When a request prefix (system prompt + shared context) is identical across multiple requests, that prefix can be cached server-side. Subsequent requests that hit the cache avoid the input processing cost of the cached portion and pay a lower cache-read token rate instead.

The conditions for effective prompt caching: the cacheable prefix must be at least 1,024 tokens, must appear identically at the start of the messages array, and the cache_control parameter must be set. The cache TTL is 5 minutes with standard caching — sufficient for high-throughput workloads. For the exam, know that prompt caching is most effective when many requests share a large, identical prefix — legal contract templates, product catalogues, knowledge bases.

Prompt Caching — API Structure

{
  "model": "claude-sonnet-4-5",
  "system": [
    {
      "type": "text",
      "text": "[Your 5,000-token system prompt here]",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "[Variable user input]"}
  ]
}

The Batch API

The Batch API processes requests asynchronously with up to 50% cost reduction compared to real-time API calls. It is designed for workloads that are not latency-sensitive: nightly report generation, bulk document processing, training data annotation. Batch requests are queued and completed within 24 hours. For the exam, understand that the Batch API is a cost optimisation for offline/async workloads — not a substitute for real-time streaming in user-facing applications.

Output Token Optimisation

Output tokens are priced higher than input tokens. For workloads where output length is predictable and bounded, setting an appropriate max_tokens ceiling prevents unexpected cost spikes from verbose outputs. For structured extraction tasks, specifying a strict output format (via tool use or clear schema instructions) reduces token waste from explanatory prose that the application discards.

Streaming Architecture

The CCA exam tests streaming at the architecture level — when to use it, how to handle it reliably, and what can go wrong. Streaming allows the application to receive and display Claude's response incrementally rather than waiting for the complete response. It significantly improves perceived responsiveness for long responses in user-facing applications.

The Claude streaming API uses Server-Sent Events (SSE). The event sequence for a successful stream is: message_start, one or more content_block_start + content_block_delta pairs, content_block_stop, and finally message_stop. A stream that terminates without message_stop was interrupted and the response is incomplete.

Key Exam Concept — Stream Interruption Handling

Detect incompleteness by checking for the absence of message_stop
Implement retry with exponential backoff — not fixed delay
Buffer streamed content server-side before presenting to downstream systems that require complete responses
Never retry a stream by resuming from where it left off — start a new request

Error Handling in Production

Production Claude API integrations must handle errors gracefully. The CCA exam tests error handling at the architecture level — what the correct response is to each error type, not how to write the code.

429 (Rate Limit): The application has exceeded its rate limit. Implement exponential backoff with jitter. Consider a request queue with configurable concurrency limits. For critical applications, request rate limit increases from Anthropic ahead of anticipated load spikes.

529 (Overloaded): Anthropic's infrastructure is under load. Treat identically to 429 — exponential backoff with jitter. This is a temporary condition.

401 (Unauthorized): Invalid or missing API key. This requires human intervention — do not retry. Alert operations and surface a meaningful error to the application.

400 (Bad Request): Invalid request structure. This is a developer error — check the request construction logic. Do not retry without fixing the root cause.

500/503 (Server Errors): Retry with exponential backoff. These are transient server-side issues.

Exam Rule — Error Response Patterns

Transient errors (429, 529, 500, 503) → retry with exponential backoff + jitter
Client errors (400, 401) → do not retry without code or configuration fix
Exponential backoff: start at 1s, double each retry, cap at 60s, add random jitter
"Retry with fixed delay" answers are always wrong on the CCA

Structured Output and Tool Use for JSON Enforcement

Domain 1 regularly tests the difference between requesting JSON in a system prompt versus enforcing it through the tool use mechanism. This is a question of reliability, not preference.

When you ask Claude to "respond in JSON format" via the system prompt, Claude will usually comply — but not always. Edge cases, unusual inputs, and certain types of refusals can produce non-JSON output that breaks downstream parsing. For enterprise production systems, this is unacceptable.

The correct approach is to define a tool with a JSON schema that captures the desired output structure, and set tool_choice to {"type": "tool", "name": "your_tool_name"} to force Claude to use that tool. The API enforces schema compliance on the tool_input field — Claude must produce valid JSON matching the schema, or the API returns an error. This is the only way to guarantee structured output at the API level.

Forcing Structured Output via Tool Use

{
  "model": "claude-sonnet-4-5",
  "tools": [{
    "name": "extract_contract_data",
    "description": "Extract structured data from contract text",
    "input_schema": {
      "type": "object",
      "properties": {
        "party_names": {"type": "array", "items": {"type": "string"}},
        "effective_date": {"type": "string"},
        "contract_value": {"type": "number"},
        "jurisdiction": {"type": "string"}
      },
      "required": ["party_names", "effective_date"]
    }
  }],
  "tool_choice": {"type": "tool", "name": "extract_contract_data"},
  "messages": [{"role": "user", "content": "[Contract text]"}]
}

Context Window and Conversation Management

Context window management is a frequently-tested Domain 1 topic. Claude's context window is finite. As conversations grow longer, input token costs increase linearly with conversation length. At some point, the conversation history approaches or exceeds the context window limit, causing API errors.

The exam tests three context management patterns. The sliding window approach retains the N most recent messages, discarding the oldest. Simple to implement, but loses early conversation context. The summarisation approach periodically summarises older messages, replacing them with a compact summary. Preserves semantic continuity at lower token cost. The external memory approach stores full conversation history externally and retrieves relevant portions via semantic search before each request. Most sophisticated; best for very long sessions or multi-session continuity.

For most enterprise applications, a sliding window combined with periodic summarisation is the recommended production pattern. If you're building a system where conversations may span multiple sessions, read our RAG architecture guide for approaches to external memory retrieval.

System Prompt Architecture

The system prompt is the operator's primary mechanism for configuring Claude's behaviour. The CCA exam tests how to structure system prompts effectively for enterprise applications.

A well-architected enterprise system prompt defines: the agent's role and identity, the scope of permitted and prohibited topics, the output format requirements, relevant context and knowledge (within the token budget), and the escalation path for edge cases. The order matters — role and scope first, then constraints, then format requirements, then context.

The exam distinguishes between using the system prompt as guidance (instructions that Claude follows but may deviate from in edge cases) versus using API-level constraints (like tool_choice, max_tokens, or structured output schemas) as enforcement. When behaviour must be guaranteed, use API-level enforcement. When behaviour should be shaped but flexible, use system prompt instructions.

Our advanced prompt engineering guide covers system prompt patterns in depth for enterprise applications.

Need CCA Exam Prep Support?

Our CCA Certification Prep service provides structured study plans, domain-by-domain mock exams, and direct access to certified architects. We have a 90%+ first-attempt pass rate.

Book a Free Strategy Call See CCA Prep Service →

Production Architecture Patterns

Domain 1 includes questions about production deployment patterns — how to build Claude-powered applications that are reliable, cost-efficient, and maintainable at enterprise scale.

Idempotency and Retry Safety

Retry logic must be safe. For read operations (summarisation, extraction, analysis), retrying is safe — the same input produces equivalent output. For write operations (sending emails, updating records, posting notifications), retrying without idempotency keys can produce duplicate actions. Design Claude-powered workflows with idempotency in mind: store request IDs, check for prior completion before retrying, and use database transaction semantics for state changes.

Logging and Observability

Production Claude applications require comprehensive logging: request ID, model used, input token count, output token count, latency, response finish_reason, and the full request/response payload (subject to data retention policies). This enables cost attribution, quality monitoring, debugging, and compliance auditing. The exam will test that you know what to log — not how to write the logging code.

Compliance Logging for Regulated Industries

Financial services, healthcare, and legal applications must log Claude interactions for regulatory purposes. This typically means: immutable audit logs, timestamped request/response pairs, operator attribution, and retention periods specified by regulation. Setting temperature=0 enables reproducibility — if regulators ask "what would Claude have said to this input," you can replay the request with the logged parameters and get the same output.

Domain 1 — Top 5 Exam Focus Areas

Model selection: Match model to task complexity and volume — Opus for complex reasoning, Haiku for high-volume simple tasks
Prompt caching: When it applies (large, shared prefix), how it's configured, and the cost benefit
Structured output: Tool use with tool_choice is the only reliable JSON enforcement mechanism
Error handling: Transient errors → exponential backoff with jitter; client errors → fix the code
Context management: Sliding window + summarisation for production multi-turn applications

Test Yourself

After studying this guide, you should be able to answer: Why would prompt caching be ineffective for a customer service chatbot where every conversation is unique? (Answer: no shared prefix to cache — each conversation starts fresh.) What does finish_reason: "max_tokens" indicate, and what are the architectural implications? (Answer: the response was truncated — the application must handle incomplete outputs and potentially increase max_tokens or restructure the task.) When is the Batch API inappropriate? (Answer: any latency-sensitive, user-facing application.)

If any of these questions are difficult, re-read the relevant sections above. For practice questions specifically targeting Domain 1, see the 50 CCA practice questions (Q1–Q10). For Domain 2, continue to the MCP study guide.

ClaudeImplementation Team

Claude Certified Architects with production deployments across financial services, healthcare, and enterprise SaaS. Learn more about our team.