A Claude RAG system with Pinecone solves the single biggest limitation of large language models: they don't know what happened after their training cutoff, and they don't know anything about your private data. Retrieval-augmented generation (RAG) fixes both problems by retrieving relevant documents from your knowledge base at query time and feeding them into the context window before Claude generates a response. The result is a Claude that can answer questions about your internal documentation, product knowledge, historical records, or any corpus that changes over time.

This tutorial builds a complete production-ready Claude RAG Pinecone pipeline in Python. By the end, you'll have a system that ingests documents, creates vector embeddings, stores them in Pinecone, retrieves semantically relevant chunks at query time, and feeds them to Claude Sonnet to generate grounded, citation-backed responses. If you want us to build this for your enterprise knowledge base, our Claude API integration service includes full RAG architecture and deployment.

What Is RAG and Why Claude Needs It

Claude's training data has a knowledge cutoff. It knows nothing about your company's internal documents, your product specifications, your support knowledge base, or any events that happened after its training ended. Even with Claude's large 200k token context window, you can't paste an entire knowledge base into every API call. RAG is the architecture that solves this: retrieve only the relevant sections at query time, inject them into the context window, then generate a response grounded in that retrieved content.

Pinecone is a managed vector database optimised for this retrieval step. Unlike a traditional full-text search index (Elasticsearch, OpenSearch), Pinecone performs semantic search โ€” finding documents that are conceptually similar to the query, even if they don't share the same keywords. A query about "contract termination notice period" will retrieve documents that discuss "agreement end clauses" or "termination provisions" even if the exact phrase doesn't appear. This is what makes RAG useful for real knowledge management problems.

Claude is particularly well-suited to RAG architectures because of its extended thinking capability, its ability to synthesise across multiple retrieved documents, and its tendency to stay grounded in provided context rather than hallucinating. The combination of Pinecone's retrieval accuracy and Claude's synthesis quality produces results that feel like a genuinely knowledgeable assistant, not a search engine. For the full architectural context, see our guide on building RAG systems with the Claude API.

System Architecture

The complete Claude RAG Pinecone system has two phases: ingestion (offline) and querying (real-time). Understanding both phases before writing code will save you significant debugging time.

Ingestion Pipeline (Offline)

Raw Documents
PDF, DOCX, MD, TXT
โ†’
Chunking
Split into 500-token segments
โ†’
Embedding
text-embedding-3-small
โ†’
Pinecone Index
1536-dim vectors

Query Pipeline (Real-time)

User Query
โ†’
Embed Query
same model
โ†’
Pinecone
top-k semantic search
โ†’
Claude API
Generate with context
โ†’
Response + Citations

Two decisions define your RAG system's quality before you write a line of code: chunk size and embedding model. Chunk size controls how much text is in each retrievable unit. Too large (1500+ tokens) and you retrieve irrelevant context. Too small (under 100 tokens) and you lose the surrounding context that makes the chunk meaningful. For most document types, 400โ€“600 tokens with 50-token overlap is the right starting point.

For the embedding model, this tutorial uses OpenAI's text-embedding-3-small (1536 dimensions). You can also use Cohere's embedding models or open-source alternatives like all-MiniLM-L6-v2 via sentence-transformers. The critical constraint is that you must use the same embedding model for both ingestion and querying โ€” never mix models in the same index.

Step 1: Set Up Dependencies

01

Install required packages

You'll need the Anthropic SDK, Pinecone client, OpenAI SDK (for embeddings), and document parsing libraries.

# Install core dependencies
pip install anthropic pinecone-client openai python-dotenv

# Document parsing (install what you need)
pip install pypdf2 python-docx markdown beautifulsoup4

# Text splitting
pip install langchain-text-splitters  # or tiktoken for token-aware splitting
# .env
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-openai-key   # For embeddings only
PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=us-east-1-aws  # or your Pinecone region
PINECONE_INDEX_NAME=claude-rag-index
02

Create your Pinecone index

Create the vector index with the correct dimensions for your embedding model before ingesting documents.

# setup_pinecone.py
from pinecone import Pinecone, ServerlessSpec
import os
from dotenv import load_dotenv

load_dotenv()

pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])

index_name = os.environ['PINECONE_INDEX_NAME']

# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,          # text-embedding-3-small dimensions
        metric='cosine',         # cosine similarity for text search
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )
    print(f"Created Pinecone index: {index_name}")
else:
    print(f"Index {index_name} already exists")

index = pc.Index(index_name)
print(f"Index stats: {index.describe_index_stats()}")

Step 2: Embed and Index Your Documents

Document ingestion has three stages: parse the raw document into text, split the text into chunks, and embed each chunk into a vector. The metadata you store alongside each vector is as important as the vector itself โ€” it's what lets you filter searches and construct citations in the response.

# ingest.py โ€” Document ingestion pipeline
import os
import hashlib
import json
from pathlib import Path
from typing import List, Dict
from dotenv import load_dotenv
from openai import OpenAI
from pinecone import Pinecone

load_dotenv()

openai_client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])
index = pc.Index(os.environ['PINECONE_INDEX_NAME'])

def load_document(file_path: str) -> str:
    """Load text from various file types."""
    path = Path(file_path)
    suffix = path.suffix.lower()

    if suffix == '.txt' or suffix == '.md':
        return path.read_text(encoding='utf-8')

    elif suffix == '.pdf':
        import PyPDF2
        with open(file_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            return '\n'.join(page.extract_text() for page in reader.pages)

    elif suffix == '.docx':
        from docx import Document
        doc = Document(file_path)
        return '\n'.join(para.text for para in doc.paragraphs)

    else:
        raise ValueError(f"Unsupported file type: {suffix}")

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

def embed_texts(texts: List[str]) -> List[List[float]]:
    """Embed a list of texts using OpenAI's embedding model."""
    # Batch in groups of 100 to avoid rate limits
    all_embeddings = []
    batch_size = 100
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            input=batch,
            model='text-embedding-3-small'
        )
        all_embeddings.extend([item.embedding for item in response.data])
    return all_embeddings

def ingest_document(file_path: str, source_name: str, metadata: Dict = None):
    """Full ingestion pipeline for a single document."""
    print(f"Processing: {file_path}")

    # Load and chunk
    text = load_document(file_path)
    chunks = chunk_text(text, chunk_size=500, overlap=50)
    print(f"  Created {len(chunks)} chunks")

    # Embed all chunks
    embeddings = embed_texts(chunks)
    print(f"  Embedded {len(embeddings)} chunks")

    # Prepare vectors for Pinecone
    vectors = []
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        # Create deterministic ID based on content
        chunk_id = hashlib.md5(f"{source_name}_{i}_{chunk[:50]}".encode()).hexdigest()

        vectors.append({
            'id': chunk_id,
            'values': embedding,
            'metadata': {
                'text': chunk,
                'source': source_name,
                'file_path': file_path,
                'chunk_index': i,
                'total_chunks': len(chunks),
                **(metadata or {})
            }
        })

    # Upsert to Pinecone in batches of 100
    batch_size = 100
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)

    print(f"  Indexed {len(vectors)} vectors to Pinecone")
    return len(vectors)


# Example: ingest a folder of documents
if __name__ == '__main__':
    docs_folder = './documents'
    for file_path in Path(docs_folder).rglob('*'):
        if file_path.suffix in ['.txt', '.md', '.pdf', '.docx']:
            ingest_document(
                str(file_path),
                source_name=file_path.stem,
                metadata={'category': 'knowledge-base', 'ingested_at': '2026-03-25'}
            )

Step 3: Query and Retrieval

The retrieval step embeds the incoming query using the same model used during ingestion, then searches Pinecone for the most semantically similar document chunks. The top_k parameter controls how many chunks you retrieve โ€” typically 5โ€“10 for a balance between context richness and token efficiency.

# retrieval.py
import os
from openai import OpenAI
from pinecone import Pinecone
from dotenv import load_dotenv

load_dotenv()

openai_client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])
index = pc.Index(os.environ['PINECONE_INDEX_NAME'])

def embed_query(query: str) -> list:
    """Embed a single query string."""
    response = openai_client.embeddings.create(
        input=query,
        model='text-embedding-3-small'
    )
    return response.data[0].embedding

def retrieve(query: str, top_k: int = 7, filter: dict = None) -> list:
    """
    Retrieve the most relevant document chunks for a query.
    Optional filter: e.g., {'category': 'policy'} to search only policy docs.
    """
    query_embedding = embed_query(query)

    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filter  # Metadata filter for scoped search
    )

    # Return chunks with their relevance scores
    chunks = []
    for match in results.matches:
        chunks.append({
            'text': match.metadata['text'],
            'source': match.metadata.get('source', 'Unknown'),
            'score': match.score,
            'chunk_index': match.metadata.get('chunk_index'),
        })

    return chunks


# Test retrieval
if __name__ == '__main__':
    results = retrieve("What is the notice period for contract termination?")
    for i, chunk in enumerate(results):
        print(f"\n--- Chunk {i+1} (score: {chunk['score']:.3f}) ---")
        print(f"Source: {chunk['source']}")
        print(f"Text: {chunk['text'][:200]}...")

Struggling with RAG retrieval quality?

Low retrieval quality is the most common RAG failure mode โ€” wrong chunks returned, relevant content missed, score thresholds misconfigured. Our Claude API integration team builds and tunes RAG pipelines for enterprise knowledge bases, including hybrid search (vector + keyword) for maximum retrieval accuracy.

Get Expert Help โ†’

Step 4: Augmented Generation with Claude

Once you have the retrieved chunks, you construct a prompt that presents them to Claude as context alongside the user's query. The quality of this prompt โ€” how you frame the retrieved context, what you instruct Claude to do with it, how you ask it to handle knowledge gaps โ€” determines the quality of the final response.

# generation.py
import os
import anthropic
from dotenv import load_dotenv
from retrieval import retrieve

load_dotenv()

client = anthropic.Anthropic(api_key=os.environ['ANTHROPIC_API_KEY'])

def format_context(chunks: list) -> str:
    """Format retrieved chunks into a context block for Claude."""
    if not chunks:
        return "No relevant documents found."

    context_parts = []
    for i, chunk in enumerate(chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['source']} (relevance: {chunk['score']:.2f})]\n"
            f"{chunk['text']}"
        )
    return "\n\n---\n\n".join(context_parts)

def generate_response(query: str, context_chunks: list, model: str = 'claude-sonnet-4-6') -> dict:
    """Generate a grounded response using Claude with retrieved context."""

    context = format_context(context_chunks)
    sources = list(set(chunk['source'] for chunk in context_chunks))

    system_prompt = """You are a knowledge base assistant. Answer questions based strictly on the provided documents.

RULES:
1. Only use information from the provided documents to answer the question
2. If the answer is not in the documents, say "I couldn't find this in the available documents" โ€” do not guess
3. Always cite which source(s) you used at the end of your response
4. Be concise and specific โ€” don't pad responses with unnecessary context
5. If documents contradict each other, note the discrepancy"""

    user_message = f"""DOCUMENTS:
{context}

QUESTION: {query}

Please answer based on the documents above."""

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=system_prompt,
        messages=[{'role': 'user', 'content': user_message}]
    )

    return {
        'answer': response.content[0].text,
        'sources': sources,
        'tokens_used': response.usage.input_tokens + response.usage.output_tokens,
        'chunks_used': len(context_chunks)
    }


# Test end-to-end
if __name__ == '__main__':
    query = "What are our policies on remote work and home office equipment?"
    chunks = retrieve(query, top_k=7)
    result = generate_response(query, chunks)
    print("\nQUESTION:", query)
    print("\nANSWER:", result['answer'])
    print("\nSOURCES:", result['sources'])
    print(f"Tokens used: {result['tokens_used']}")

Step 5: Full Pipeline Integration

The full pipeline brings ingestion and querying together into a single application. Here's a minimal Flask API that exposes the RAG system as an HTTP endpoint โ€” deployable behind your chatbot frontend, an internal tool, or a Slack integration.

# app.py โ€” Flask RAG API
from flask import Flask, request, jsonify
from dotenv import load_dotenv
from retrieval import retrieve
from generation import generate_response

load_dotenv()

app = Flask(__name__)

@app.route('/query', methods=['POST'])
def query():
    data = request.json
    user_query = data.get('query', '').strip()

    if not user_query or len(user_query) > 2000:
        return jsonify({'error': 'Invalid query'}), 400

    # Optional: filter by document category
    filter_metadata = data.get('filter')

    # Retrieve relevant chunks
    chunks = retrieve(user_query, top_k=7, filter=filter_metadata)

    # Filter by minimum relevance score (tune this for your data)
    MIN_SCORE = 0.70
    relevant_chunks = [c for c in chunks if c['score'] >= MIN_SCORE]

    if not relevant_chunks:
        return jsonify({
            'answer': "I couldn't find relevant information in the knowledge base for this question.",
            'sources': [],
            'chunks_found': 0
        })

    # Generate response
    result = generate_response(user_query, relevant_chunks)

    return jsonify({
        'answer': result['answer'],
        'sources': result['sources'],
        'chunks_used': result['chunks_used'],
        'tokens_used': result['tokens_used']
    })

@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'ok'})

if __name__ == '__main__':
    app.run(port=5000, debug=False)

Advanced Techniques

The pipeline above handles 80% of RAG use cases. These techniques address the remaining 20% โ€” the cases where basic vector search falls short.

Hybrid search: vector + keyword

Pure vector search struggles with precise lookups โ€” product codes, person names, exact phrases. Hybrid search combines vector similarity with BM25 keyword search, weighting the results together. Pinecone supports hybrid search natively with sparse-dense vectors. Add a sparse vector to each upsert and use alpha to control the keyword-to-semantic ratio at query time.

# Hybrid search with Pinecone (sparse + dense)
from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(corpus_texts)  # Train on your corpus

# During upsert, add sparse vector
sparse_values = bm25.encode_documents([chunk_text])[0]
index.upsert(vectors=[{
    'id': chunk_id,
    'values': dense_embedding,     # Dense (semantic) vector
    'sparse_values': sparse_values, # Sparse (keyword) vector
    'metadata': {...}
}])

# During query, use hybrid with alpha (0 = pure keyword, 1 = pure semantic)
results = index.query(
    vector=dense_query_embedding,
    sparse_vector=bm25.encode_queries([query])[0],
    top_k=7,
    alpha=0.7,  # 70% semantic, 30% keyword
    include_metadata=True
)

Query expansion with Claude

Before running retrieval, ask Claude to generate 3โ€“5 alternative phrasings of the user's query. Run separate retrievals for each phrasing, merge the results, and deduplicate. This technique, called query expansion, significantly improves recall for ambiguous or jargon-heavy queries. It adds one extra Claude API call but often improves answer quality dramatically.

Re-ranking retrieved chunks

Pinecone returns chunks ranked by vector similarity, but similarity isn't always the same as relevance. A re-ranker model (Cohere's Rerank API or a local cross-encoder) scores each retrieved chunk against the original query and reorders them by actual relevance. This adds latency (~200ms) but reduces the noise in your context window and improves response quality for complex queries.

Enterprise RAG Architecture

A production enterprise RAG system serving hundreds of users across a large document corpus looks substantially different from the tutorial above. These are the architectural differences that matter.

Multi-namespace organisation is the first priority. Pinecone namespaces let you isolate different knowledge bases within the same index โ€” HR policies, legal documents, product specifications, customer support history โ€” each searchable independently or together. This enables department-scoped search without managing multiple indexes. Access control at the namespace level is also more tractable than document-level ACL.

Incremental ingestion with document versioning is the second critical requirement. When a document is updated, you need to delete the old vectors and ingest the new version โ€” not re-ingest the entire knowledge base. Implement a document registry that tracks which document versions are currently indexed, and build a sync pipeline that runs on a schedule or triggers on document update events.

Evaluation and monitoring comes third. You need to measure retrieval quality and generation quality separately. Track retrieval recall (did we return the right chunks?), relevance score distributions, and generation quality (are answers grounded? are sources cited correctly?). Without measurement, you can't improve. Claude's extended thinking mode can also be used to improve reasoning quality on complex multi-hop queries where the answer requires synthesising across several document chunks.

Finally, cost architecture. A RAG query typically costs: one embedding call (~$0.0001), one Pinecone query (~$0.001), and one Claude API call (~$0.01โ€“0.05 depending on context size and model). At 10,000 queries per day, you're looking at roughly $100โ€“500/month in API costs alone. Use Claude Haiku for simple factual queries, Sonnet for analysis, and Opus only for high-stakes synthesis tasks. Read the prompt caching guide to cut costs on shared system prompts.

Key Takeaways

  • Chunk size (400โ€“600 tokens with 50-token overlap) is the most important tuning parameter in RAG
  • Use the same embedding model for both ingestion and query โ€” never mix models in one index
  • Set a minimum relevance score threshold (0.65โ€“0.75) to avoid injecting low-quality context
  • Hybrid search (vector + BM25) outperforms pure semantic search for enterprise knowledge bases with precise terminology
  • Measure retrieval quality and generation quality separately โ€” they fail for different reasons

Related Articles

CI

ClaudeImplementation Team

Claude Certified Architects building enterprise RAG systems and AI integrations from strategy to production. Meet the team โ†’