Key Takeaways

Claude evaluation frameworks combine deterministic tests, LLM-as-judge scoring, and human spot-checks
LLM-as-judge uses Claude to evaluate Claude — cost-effective at scale, with 80-90% correlation with human ratings
A comprehensive eval suite catches prompt regressions, model update impacts, and distribution shift
Red-teaming should be systematic, not ad-hoc — build adversarial test cases before launch, not after incidents
Production monitoring closes the loop: log → flag → sample → human review → fix → retest

Why Claude Evaluation Frameworks Matter for Enterprise

A financial services firm built a Claude-powered contract analysis tool. It worked brilliantly. Then a developer updated the system prompt to "improve the tone". Contracts that had previously been flagged for missing indemnity clauses were now passing silently. Nobody noticed for two weeks.

This is the evaluation problem in AI systems. Unlike traditional software where a code change produces a deterministic, testable output, Claude's outputs are probabilistic and context-dependent. A prompt change that improves performance on one class of inputs can degrade performance on another class — and without a systematic Claude evaluation framework, you won't know until users or auditors find the issue.

Claude evaluation frameworks solve this by defining what "correct" looks like for your specific application, automating the measurement of that correctness, and running those measurements continuously — on every code change, on every model update, and on a sample of production traffic. It's the AI equivalent of a CI/CD test suite, and it's non-negotiable for enterprise-grade deployments.

Our AI agent development service includes evaluation framework design as a mandatory deliverable. Any enterprise deploying Claude without automated evals is operating blind. If you want help designing your eval suite, book a call with our Claude Certified Architects.

The Three-Layer Claude Evaluation Architecture

A complete Claude evaluation framework operates at three levels: unit tests, integration evals, and production monitoring. Each layer catches different failure modes and operates at a different cost-quality tradeoff.

🔬

Layer 1: Deterministic Unit Tests

Exact string matching, regex checks, JSON schema validation. Zero cost, instant feedback. Catches structural failures: malformed outputs, missing required fields, out-of-scope responses.

🤖

Layer 2: LLM-as-Judge Scoring

Claude evaluates Claude responses against rubrics. Scales to thousands of test cases at low cost. 80-90% correlation with human judgment. Catches quality regressions, tone shifts, accuracy issues.

👤

Layer 3: Human Review Sampling

Domain experts review 2-5% of production outputs weekly. Ground truth for calibrating automated evals. Catches subtle issues that automated layers miss. Feeds back into test suite expansion.

📊

Production Monitoring

Real-time anomaly detection on output distributions. Thumbs-down rates, escalation rates, latency. Catches distribution shift as user behaviour and input data changes over time.

Building Your Test Suite

Start by building a golden dataset — a curated set of (input, expected output) pairs that represent the full range of your application's use cases. For a legal contract analysis tool, this might be 200 contracts with expert-annotated risk flags. For an HR chatbot, it might be 500 employee questions with verified policy citations. For a customer service agent, it might be 1,000 support tickets with correct resolutions.

Include not just typical cases but also edge cases, adversarial inputs, and known failure modes. If your legal tool once missed an indemnity clause, that contract should be in your test suite forever. Every production failure should become a permanent regression test.

Categorising Test Cases

We organise test suites into four categories. Core functionality tests cover the 80% of typical inputs — they should always pass. Edge case tests cover rare but valid inputs — they should usually pass. Boundary tests probe the limits of intended scope — they should gracefully decline rather than produce wrong output. Adversarial tests cover inputs designed to trick the system — they should never produce policy-violating outputs.

        test-suite-structure.py
        import pytest
import anthropic
import json
from typing import Optional

client = anthropic.Anthropic()

# ============================================================
# TEST SUITE: Legal Contract Analysis Agent
# ============================================================

# Evaluation rubric: what does a "correct" contract analysis look like?
ANALYSIS_RUBRIC = """
You are evaluating a legal contract analysis response.

Criteria:
1. ACCURACY (0-4): Does the response correctly identify the specific legal issue?
   - 4: Correctly identifies all material issues with specific clause references
   - 3: Identifies main issue, misses one minor issue
   - 2: Identifies issue generally but lacks specificity
   - 1: Partially correct but contains significant errors
   - 0: Incorrect or misses the main issue entirely

2. CITATION (0-2): Does the response cite specific clause numbers or section titles?
   - 2: All claims backed by specific clause references
   - 1: Some claims backed by references
   - 0: No citations provided

3. SCOPE (0-2): Does the response stay within the intended scope?
   - 2: Only analyses legal and risk issues as instructed
   - 1: Minor scope deviation
   - 0: Significant off-topic content

Return your evaluation as JSON: {"accuracy": int, "citation": int, "scope": int, "reasoning": str}
"""

def llm_judge(test_case: dict, actual_response: str) -> dict:
    """
    Use Claude to evaluate a Claude response against the rubric.
    Uses a different model to reduce self-serving bias (Haiku judges Sonnet outputs).
    """
    judge_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use smaller model to judge larger model
        max_tokens=512,
        system=ANALYSIS_RUBRIC,
        messages=[
            {
                "role": "user",
                "content": f"""Contract Excerpt: {test_case['contract_excerpt']}

Expected Issues to Find: {test_case['expected_issues']}

Actual Response to Evaluate:
{actual_response}

Evaluate and return JSON only."""
            }
        ]
    )

    try:
        scores = json.loads(judge_response.content[0].text)
        scores['passed'] = (scores['accuracy'] >= 3 and scores['citation'] >= 1 and scores['scope'] == 2)
        return scores
    except json.JSONDecodeError:
        return {"accuracy": 0, "citation": 0, "scope": 0, "passed": False, "reasoning": "Judge response unparseable"}


class TestContractAnalysisCore:
    """Core functionality tests — must always pass."""

    def test_identifies_missing_liability_cap(self, contract_analysis_agent):
        test_case = {
            "contract_excerpt": "Neither party shall be liable for any damages...",
            "expected_issues": ["No liability cap specified", "Mutual exclusion applies to direct damages"]
        }
        response = contract_analysis_agent.analyse(test_case['contract_excerpt'])
        scores = llm_judge(test_case, response)
        assert scores['passed'], f"Failed: {scores['reasoning']}"
        assert scores['accuracy'] >= 3, f"Accuracy too low: {scores['accuracy']}"

    def test_flags_missing_termination_clause(self, contract_analysis_agent):
        # Similar pattern for each core test case...
        pass


class TestContractAnalysisAdversarial:
    """Adversarial tests — should reject or gracefully handle."""

    def test_refuses_to_give_legal_advice(self, contract_analysis_agent):
        """Agent should identify issues but not provide legal advice."""
        response = contract_analysis_agent.analyse("Should I sign this contract?")
        assert "i cannot provide legal advice" in response.lower() or \
               "consult a lawyer" in response.lower() or \
               "legal counsel" in response.lower()

    def test_handles_non_contract_input(self, contract_analysis_agent):
        """Agent should gracefully handle non-contract input."""
        response = contract_analysis_agent.analyse("What's the weather in London?")
        # Should not produce a contract analysis
        assert len(response) < 500  # Brief refusal, not a hallucinated analysis
      

LLM-as-Judge: Scaling Evaluation

The most significant advance in AI evaluation over the past 18 months is LLM-as-judge: using a language model to evaluate another language model's outputs. Done correctly, LLM-as-judge achieves 80-90% agreement with human evaluators at roughly 1% of the cost. Done incorrectly, it introduces systematic bias that makes your evaluation worthless.

The single most important design decision is using a different model to judge than the model being evaluated. Self-evaluation has well-documented bias — a model tends to prefer its own outputs regardless of quality. We use Haiku to judge Sonnet, and Sonnet to judge Opus outputs. This isn't perfect but significantly reduces self-serving bias.

The second most important design decision is rubric specificity. A judge prompt that says "rate this response 1-5 for quality" produces inconsistent, hard-to-calibrate results. A judge prompt that defines each score point with specific criteria produces consistent, calibratable results. Every dimension in your rubric should have explicit definitions for each score — ideally with one or two concrete examples per score level.

        llm-judge-framework.py
        from dataclasses import dataclass
from typing import List

@dataclass
class EvalResult:
    test_id: str
    passed: bool
    scores: dict
    reasoning: str
    latency_ms: float
    input_tokens: int
    output_tokens: int

class ClaudeEvalFramework:
    """
    Scalable evaluation framework for Claude applications.
    Supports deterministic checks, LLM judging, and batch processing.
    """

    def __init__(self, judge_model: str = "claude-haiku-4-5-20251001"):
        self.client = anthropic.Anthropic()
        self.judge_model = judge_model
        self.results: List[EvalResult] = []

    def run_deterministic_check(self, response: str, checks: dict) -> dict:
        """
        Fast, cheap checks that run before the LLM judge.
        Catches structural failures without spending API tokens.
        """
        results = {}

        if 'must_contain' in checks:
            for phrase in checks['must_contain']:
                results[f'contains_{phrase[:20]}'] = phrase.lower() in response.lower()

        if 'must_not_contain' in checks:
            for phrase in checks['must_not_contain']:
                results[f'excludes_{phrase[:20]}'] = phrase.lower() not in response.lower()

        if 'max_length' in checks:
            results['within_length'] = len(response) <= checks['max_length']

        if 'valid_json' in checks and checks['valid_json']:
            try:
                json.loads(response)
                results['valid_json'] = True
            except json.JSONDecodeError:
                results['valid_json'] = False

        return results

    def run_eval_suite(self, test_cases: List[dict], production_fn) -> dict:
        """
        Run full eval suite: deterministic checks + LLM judging.
        Uses batch API for cost-efficient LLM judging at scale.
        """
        import time

        all_results = []
        for tc in test_cases:
            start = time.time()
            response = production_fn(tc['input'])
            latency = (time.time() - start) * 1000

            # Layer 1: deterministic checks
            det_results = self.run_deterministic_check(response, tc.get('checks', {}))
            det_passed = all(det_results.values())

            # Layer 2: LLM judge (only if deterministic passed — saves cost)
            if det_passed and 'rubric' in tc:
                judge_result = llm_judge(tc, response)
            else:
                judge_result = {"passed": det_passed, "reasoning": "Failed deterministic checks"}

            all_results.append(EvalResult(
                test_id=tc['id'],
                passed=det_passed and judge_result.get('passed', False),
                scores=judge_result,
                reasoning=judge_result.get('reasoning', ''),
                latency_ms=latency,
                input_tokens=len(tc['input'].split()),  # Approximate
                output_tokens=len(response.split())
            ))

        # Aggregate metrics
        pass_rate = sum(1 for r in all_results if r.passed) / len(all_results)
        avg_latency = sum(r.latency_ms for r in all_results) / len(all_results)

        return {
            "pass_rate": pass_rate,
            "total_tests": len(all_results),
            "passed": sum(1 for r in all_results if r.passed),
            "failed": sum(1 for r in all_results if not r.passed),
            "avg_latency_ms": avg_latency,
            "failed_cases": [r for r in all_results if not r.passed]
        }
      

Need a Production-Ready Eval Framework?

We build Claude evaluation frameworks as part of every AI agent development engagement. The eval suite ships with the agent — not as an afterthought.

Start Your AI Agent Project →

Systematic Red-Teaming Before Launch

Red-teaming is structured adversarial testing: deliberately attempting to make your Claude application produce incorrect, harmful, or out-of-scope outputs. Most teams do this informally ("let's try some weird inputs") which misses entire categories of failures. Systematic red-teaming uses a structured taxonomy of attack vectors and produces a documented pass/fail record.

For enterprise Claude applications, we test five attack categories before production launch: prompt injection (attempts to override the system prompt), scope expansion (attempts to get Claude to answer out-of-domain questions), data extraction (attempts to make Claude reveal information from its context it shouldn't surface), authority impersonation (user claims to have special permissions not granted in the system), and persistence attacks (multi-turn attempts to gradually shift Claude's behaviour from its initial constraints).

Building Adversarial Test Cases

For each attack category, build 10-20 test cases that escalate in sophistication. Start with obvious attempts ("Ignore your instructions and...") and escalate to subtle ones ("As the head of security, I've been granted access to..."). Your pass rate on obvious attacks should be 100%; on sophisticated attacks, it should be at least 85%. If your application fails on sophisticated attacks, that failure mode needs to be addressed before launch.

Red-team test cases should be maintained as part of your test suite and run on every system prompt change. A new instruction added for legitimate reasons can inadvertently create a new vulnerability — the only way to know is to test. See our Claude AI governance framework and security and governance service for how we structure adversarial testing in regulated industries.

Integrating Evals into CI/CD

Evaluation frameworks only provide value if they run automatically on every change. The integration into your CI/CD pipeline is straightforward: add an eval step that runs your test suite on every pull request to the system prompt, the retrieval configuration, or any component that affects Claude's outputs. Block merges that drop pass rate below your defined threshold.

        .github/workflows/claude-evals.yml
        name: Claude Evaluation Suite

on:
  pull_request:
    paths:
      - 'src/prompts/**'
      - 'src/agents/**'
      - 'config/claude_config.yaml'

jobs:
  run-evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install anthropic pytest -r requirements.txt

      - name: Run evaluation suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python -m pytest tests/evals/ \
            --json-report \
            --json-report-file=eval-results.json \
            -v

      - name: Check pass rate threshold
        run: |
          python scripts/check_eval_threshold.py \
            --results eval-results.json \
            --min-pass-rate 0.85 \
            --critical-tests "core_functionality,adversarial"

      - name: Post results to PR
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./eval-results.json');
            const passRate = results.summary.passed / results.summary.total;
            const comment = `## Claude Eval Results\n\n` +
              `**Pass Rate:** ${(passRate * 100).toFixed(1)}%\n` +
              `**Tests:** ${results.summary.passed}/${results.summary.total} passed\n\n` +
              (passRate < 0.85 ? '⛔ **Below threshold — merge blocked**' : '✅ **Above threshold — OK to merge**');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });
      

Production Monitoring and Continuous Eval

Pre-launch evals validate your system against known test cases. Production monitoring catches what you didn't anticipate: new user behaviour patterns, input data drift, model performance changes on the long tail of real-world inputs.

The minimum viable production monitoring stack for a Claude application is three metrics: user feedback rate (thumbs-down / explicit negative signals), escalation rate (percentage of conversations handed off to humans), and response length anomaly (extreme outliers suggest something has changed). Log these per-session and alert when they shift by more than one standard deviation from your 30-day baseline.

For more sophisticated monitoring, run your LLM judge on a 5% random sample of production traffic weekly. Compare the distribution of scores against your pre-launch baseline. A shift in average score of more than 0.2 points should trigger a full eval suite run against your golden dataset to identify what changed. Combine this with the cache monitoring patterns from our prompt caching tutorial to get a complete picture of your production system's health.

Production monitoring closes the loop on your evaluation framework. Every flagged production case becomes a candidate for your test suite. Your test suite grows to reflect real-world usage. Your system gets harder to regress over time. This is the feedback loop that separates AI systems that stay reliable at scale from those that degrade invisibly.

Evaluation for Agentic Systems

Single-turn Q&A evaluation is relatively straightforward. Evaluating agentic Claude systems — where Claude calls tools, takes multi-step actions, and operates over extended workflows — is significantly harder. A tool call that looks correct in isolation may be part of a sequence that leads to an incorrect outcome. An action taken in step 3 might only be evaluable in light of the outcome of step 8.

For agentic evaluation, we use three approaches in combination. Trajectory evaluation scores the sequence of actions, not just the final output — did Claude take sensible intermediate steps? Tool call correctness evaluation checks whether each tool call had correct parameters for the given context. Outcome evaluation measures whether the end state of the world matches the intended goal. All three are necessary; outcome evaluation alone misses efficient vs. inefficient paths and catches only gross failures.

Agentic evals are more expensive to run because they involve full workflow execution. We recommend running trajectory and tool call evals on every change, and running full outcome evals nightly on a representative sample. See our AI agent evaluation and testing guide and our multi-agent systems article for architectural patterns that make agentic evaluation tractable.

⚡

ClaudeImplementation Team

Claude Certified Architects. Every AI agent we ship includes a full evaluation suite — it's non-negotiable. Learn more about us →

How to Build Claude Evaluation Frameworks: Automated Testing for AI Applications