Most enterprise AI deployments start with text. Then someone asks: "Can Claude look at this invoice scan?" or "Can it read the tables in this PDF?" or "Can it analyse this product photo and flag defects?" That's when the Claude Vision API becomes essential infrastructure.
Claude Sonnet 4.6 and Opus 4.6 are both fully multimodal โ they accept images, PDFs, and documents as part of the API request, alongside text prompts. This guide covers how the Vision API works, the formats it supports, how to implement it correctly, and the enterprise use cases where it delivers real business value. For a broader look at the Claude API for enterprise, see our product guide.
What the Claude Vision API Can Process
Claude's vision capabilities are not OCR with an LLM bolted on. The model genuinely understands visual content โ it can interpret diagrams, read handwriting, extract structured data from tables, understand spatial relationships in engineering drawings, and reason about what it sees in the context of your prompt.
Supported input formats include JPEG, PNG, GIF, and WebP images, as well as PDF documents. The model can receive multiple images in a single request (up to 20 images or approximately 5MB per image), enabling workflows like comparing documents side-by-side, processing multi-page scans, or combining visual and textual evidence in a single reasoning chain.
Image Token Costs
Vision inputs are billed in tokens like text inputs. A 1000ร1000 pixel image consumes approximately 1,500โ2,000 tokens depending on the image complexity. High-resolution images are automatically downscaled by the API to a maximum of 1,568 pixels on the longest edge โ you can also pass a low resolution flag to further reduce token consumption for simple images where fine detail isn't needed. For a financial client processing 50,000 invoice images per month, optimising image resolution reduced their vision API costs by 38%.
API Implementation: How to Send Images
The Claude Vision API uses the standard Messages API structure. Images are passed as content blocks of type image within the user message, either as base64-encoded data or as a URL (for publicly accessible images).
import anthropic, base64 with open("invoice.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") client = anthropic.Anthropic() message = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data, }, }, { "type": "text", "text": "Extract the following fields as JSON: invoice_number, vendor_name, invoice_date, line_items (array), total_amount, currency" } ], }] ) print(message.content[0].text)
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/product-image.png"
},
},
{
"type": "text",
"text": "Identify any visible defects or quality issues in this product photo. Respond in JSON."
}
],
}]
)
Security Note: Image URLs
When using URL-based image sources, Claude fetches the image from the URL at request time. For sensitive enterprise images, use base64 encoding instead โ this keeps the image data in your API call and prevents any third-party access to image URLs. Our API integration service includes a security review of image data handling in your architecture.
Enterprise Use Cases for Claude Vision
The Claude Vision API isn't a novelty feature. In production enterprise deployments, it replaces expensive custom OCR pipelines, manual data entry teams, and brittle template-matching systems that break on every new document format.
Invoice & Receipt Processing
Extract line items, totals, vendor details, and tax fields from invoices in any format โ scanned PDFs, photos, handwritten receipts. No template configuration required. Accuracy consistently exceeds 97% on structured invoices in our production deployments for accounts payable automation.
Contract & Document Review
Send scanned contract pages or PDF attachments for clause extraction, obligation identification, and risk flagging. Claude reads tables, footnotes, handwritten annotations, and redlined text โ significantly outperforming standard PDF text extraction on complex legal documents.
Chart & Data Visualisation Analysis
Convert images of charts, graphs, and dashboards into structured data. Claude can read bar charts, line graphs, pie charts, and complex multi-axis visualisations and return the underlying data as JSON or CSV โ eliminating manual data re-entry from reports and analyst decks.
Manufacturing Quality Control
Analyse product photos for defects, surface irregularities, assembly errors, or packaging damage. Claude can describe what it observes with precision and classify images against a quality standard defined in your prompt. Deployed in electronics and consumer goods manufacturing.
Engineering Drawing Interpretation
Process CAD drawings, blueprints, floor plans, and technical schematics. Claude can extract dimensions, identify components, describe assembly sequences, and answer questions about drawings โ enabling AI-assisted design review and specification extraction.
Healthcare Document Processing
Process medical forms, lab result scans, referral letters, and insurance documents. With appropriate governance in place, the Vision API accelerates administrative workflows that previously required manual data entry by clinical staff. Claude never stores image data after processing.
Building a document processing pipeline?
Our Claude API integration service designs and deploys production document processing pipelines โ including image preprocessing, error handling, validation, and cost-optimised batching. Most document workflows are live within 4โ6 weeks.
Book a Free Architecture Call โHandling PDF Documents
Claude accepts PDF files natively via the API โ you don't need to pre-convert PDFs to images or extract text separately. PDFs are passed as a document content block (not image) with the application/pdf media type. The model processes the full document including embedded images, tables, headers, footers, and multi-column layouts โ it treats the PDF as it visually appears, not just as extracted text.
with open("contract.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8") message = client.messages.create( model="claude-opus-4-6", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data, }, }, { "type": "text", "text": "Review this contract and identify: (1) payment terms, (2) termination clauses, (3) liability caps, (4) governing law. Return as structured JSON." } ], }] )
PDF Size Limits and Batching
The API accepts PDF documents up to 32MB and up to 100 pages per request. For longer documents, split them into logical sections โ by chapter, section, or exhibit โ and process in parallel requests. Use the Batch API for high-volume document processing workloads to cut costs by 50% while processing documents asynchronously.
Prompting Strategies for Vision Tasks
The quality of visual extraction depends as much on prompt design as on model capability. The following strategies consistently improve accuracy in production document processing systems.
Be Explicit About Output Format
Always specify the exact output structure you need. "Extract the invoice data" is worse than "Extract the following fields as valid JSON: {invoice_number: string, vendor_name: string, line_items: [{description: string, quantity: number, unit_price: number, total: number}], subtotal: number, tax: number, total: number}". The more specific your schema, the more reliably Claude produces structured data you can parse programmatically.
Chain Vision with Reasoning
For complex analysis tasks, use Claude's extended thinking or multi-turn architecture. First request: extract all visible text and data. Second request: reason about the extracted content. This separation reduces errors on documents where layout is ambiguous.
Provide Domain Context
Tell Claude what kind of document it's processing. "This is a UK standard form NEC4 construction contract" produces better clause extraction than a generic prompt. Domain context activates relevant model knowledge and reduces misinterpretation of industry-specific terminology.
Enterprise Architecture Considerations
Deploying the Vision API at scale requires more than just passing images to the API. Production document processing pipelines need error handling, validation, retry logic, quality thresholds, and cost governance.
For high-volume processing (100K+ documents per month), the architecture pattern we use combines the Batch API for bulk processing, prompt caching for repeated system prompts, and a validation layer that checks extracted JSON against your schema before downstream ingestion. We also recommend a confidence scoring step โ returning extraction confidence with each field โ so exceptions can be routed to human review automatically.
If you're building an agentic document workflow โ where Claude not only extracts data but then takes actions based on it โ our AI agent development service designs the full pipeline, from document ingestion to downstream system integration via MCP servers.
Key Takeaways
- Claude Vision API supports JPEG, PNG, GIF, WebP images and PDF documents natively
- Use base64 encoding for sensitive enterprise images; URL source for public/low-sensitivity assets
- Specify exact JSON output schemas for structured extraction โ vague prompts produce unreliable results
- Use Batch API for high-volume document processing to cut costs by 50%
- For PDFs, Claude processes the full visual layout โ not just extracted text โ producing better accuracy on complex documents