PATHSDATA
Back to Blog
Generative AIAWS

Building Production-Ready RAG Systems with AWS Bedrock

A comprehensive guide to building scalable Retrieval Augmented Generation systems using Amazon Bedrock, Claude, and vector databases.

PATHSDATA TeamJanuary 10, 202515 min read

1Introduction to RAG

Retrieval Augmented Generation (RAG) has emerged as the go-to architecture for building AI applications that need access to private or current information. Unlike fine-tuning, RAG allows you to ground LLM responses in your own data without the cost and complexity of model training.

Why RAG over Fine-tuning?

RAG is ideal when your data changes frequently, you need source attribution, or you want to avoid the cost and latency of fine-tuning. Fine-tuning is better for teaching the model new behaviors or specialized reasoning patterns.

Key Benefits of RAG

  • Access to private, proprietary data without fine-tuning
  • Real-time information retrieval (no knowledge cutoff)
  • Source attribution and explainability
  • Lower cost than fine-tuning or training custom models
  • Easy to update knowledge base without retraining

2RAG Architecture Overview

A production RAG system consists of two main pipelines: the ingestion pipeline (processing and storing documents) and the retrieval pipeline (answering queries).

RAG Architecture on AWS

Ingestion Pipeline

Documents

S3, APIs, DBs

Chunking

Split & Process

Embeddings

Titan Embeddings

Vector Store

OpenSearch Serverless

or Pinecone / pgvector

Query Pipeline

User Query

Natural Language

Retrieval

Semantic Search

Claude 3

AWS Bedrock

3Setting Up AWS Bedrock

Amazon Bedrock provides access to foundation models from Anthropic (Claude), Amazon (Titan), and others through a unified API. Here's how to set up Bedrock with LangChain:

Pythonbedrock_setup.py
import boto3
from langchain_aws import ChatBedrock
from langchain_aws import BedrockEmbeddings

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'
)

# Initialize Claude model
llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    client=bedrock_runtime,
    model_kwargs={
        "temperature": 0.1,
        "max_tokens": 4096
    }
)

# Initialize embeddings model
embeddings = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v2:0"
)

Prerequisites

Ensure you have enabled model access in the AWS Bedrock console and have the appropriate IAM permissions (bedrock:InvokeModel).

Choosing the Right Model

ModelBest ForContextCost
Claude 3.5 SonnetBest balance of speed & quality200K tokens$$
Claude 3 OpusComplex reasoning tasks200K tokens$$$$
Claude 3 HaikuHigh-volume, simple queries200K tokens$
Titan Embeddings v2Document embeddings8K tokens$

4Choosing a Vector Database

The vector database stores document embeddings and enables semantic search. Here are the top options for AWS deployments:

OpenSearch Serverless

Best for: Enterprise production

Fully managed

Hybrid search

Native AWS integration

Higher cost at scale

Pinecone

Best for: Quick prototypes

Simple API

Fast queries

Good free tier

External service

Data leaves AWS

pgvector (RDS)

Best for: Small datasets

Low cost

Familiar SQL

Single database

Manual scaling

Limited features

OpenSearch Serverless Setup

Pythonvector_store.py
from langchain_community.vectorstores import OpenSearchVectorSearch
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

# AWS authentication for OpenSearch
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    'us-east-1',
    'aoss',
    session_token=credentials.token
)

# Initialize OpenSearch vector store
vector_store = OpenSearchVectorSearch(
    index_name="knowledge-base",
    embedding_function=embeddings,
    opensearch_url="https://your-collection.us-east-1.aoss.amazonaws.com",
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
)

5Document Chunking Strategies

Chunking is one of the most impactful decisions in RAG. Poor chunking leads to poor retrieval, regardless of how good your LLM is.

The Golden Rule of Chunking

Each chunk should contain enough context to answer a question on its own. If a chunk requires the previous or next chunk to make sense, your chunks are too small.

Recommended Chunking Strategy

Pythonchunking.py
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Optimized chunking strategy
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Optimal for most use cases
    chunk_overlap=200,      # 20% overlap for context preservation
    length_function=len,
    separators=[
        "\n\n",           # Paragraph breaks (highest priority)
        "\n",              # Line breaks
        ". ",               # Sentences
        ", ",               # Clauses
        " ",                # Words
        ""                  # Characters (fallback)
    ]
)

# Process documents
chunks = text_splitter.split_documents(documents)

# Add metadata for better retrieval
for i, chunk in enumerate(chunks):
    chunk.metadata["chunk_id"] = i
    chunk.metadata["source_file"] = document_name
    chunk.metadata["timestamp"] = datetime.now().isoformat()

Chunk Size Guidelines

Content TypeChunk SizeOverlap
Technical documentation1000-1500 tokens200 tokens (20%)
Legal documents500-800 tokens100 tokens (15%)
Q&A / FAQsPer questionNone
Code filesPer function/classInclude imports

6Advanced Retrieval Techniques

Basic semantic search gets you 70% of the way there. These advanced techniques can significantly improve retrieval quality:

1. Hybrid Search (Semantic + Keyword)

Combining vector similarity with traditional BM25 keyword search often outperforms either approach alone, especially for queries with specific terms or names.

Pythonhybrid_search.py
from opensearchpy import OpenSearch

# Hybrid search combining semantic + keyword search
def hybrid_search(query: str, k: int = 5, alpha: float = 0.7):
    """
    Perform hybrid search with configurable weighting.
    alpha: 0.0 = pure keyword, 1.0 = pure semantic
    """
    
    # Get query embedding
    query_embedding = embeddings.embed_query(query)
    
    # Hybrid search query
    search_query = {
        "size": k,
        "query": {
            "hybrid": {
                "queries": [
                    # Semantic search (vector)
                    {
                        "knn": {
                            "embedding": {
                                "vector": query_embedding,
                                "k": k
                            }
                        }
                    },
                    # Keyword search (BM25)
                    {
                        "match": {
                            "text": {
                                "query": query,
                                "boost": 1.0
                            }
                        }
                    }
                ]
            }
        },
        # Reranking for better results
        "search_pipeline": "hybrid-search-pipeline"
    }
    
    results = opensearch_client.search(
        index="knowledge-base",
        body=search_query
    )
    
    return results["hits"]["hits"]

2. Query Expansion

Use the LLM to generate alternative phrasings of the user's query before searching:

Original Query

"How do I fix the login error?"

Expanded Queries

  • • "authentication failure troubleshooting"
  • • "login error resolution steps"
  • • "sign in problem fix"

3. Contextual Compression

After retrieval, use an LLM to extract only the relevant portions of each chunk. This reduces noise and allows you to include more documents in the context.

4. Metadata Filtering

Add metadata to chunks (date, source, category) and filter before or after retrieval:

  • Filter by date for time-sensitive queries
  • Filter by source for domain-specific questions
  • Filter by access level for multi-tenant systems

7Evaluation & Monitoring

You can't improve what you don't measure. Use RAGAS (Retrieval Augmented Generation Assessment) to evaluate your RAG pipeline:

Pythonevaluation.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

# Prepare evaluation dataset
eval_dataset = {
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": expected_answers
}

# Run RAGAS evaluation
results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,        # Is answer faithful to context?
        answer_relevancy,    # Is answer relevant to question?
        context_precision,   # Are retrieved docs relevant?
        context_recall       # Are all relevant docs retrieved?
    ]
)

print(f"Faithfulness: {results['faithfulness']:.2f}")
print(f"Answer Relevancy: {results['answer_relevancy']:.2f}")
print(f"Context Precision: {results['context_precision']:.2f}")
print(f"Context Recall: {results['context_recall']:.2f}")

Key Metrics to Track

Faithfulness

> 0.85

Is the answer grounded in the retrieved context?

Answer Relevancy

> 0.80

Does the answer address the question?

Context Precision

> 0.75

Are the retrieved docs relevant?

Context Recall

> 0.70

Were all relevant docs retrieved?

8Production Best Practices

Implement Caching

Cache embeddings for repeated queries. Use ElastiCache or DynamoDB to reduce latency and costs.

Set Up Guardrails

Use Bedrock Guardrails to filter harmful content, PII, and off-topic responses.

Monitor Costs

Track token usage per query. Set up CloudWatch alarms for unexpected spikes.

Handle Failures Gracefully

Implement fallbacks: if retrieval fails, acknowledge the limitation rather than hallucinating.

Version Your Knowledge Base

Track document versions. Allow rollback if new documents degrade quality.

Implement Feedback Loops

Collect user feedback (thumbs up/down). Use it to identify retrieval failures and improve prompts.

Conclusion

Building a production-ready RAG system requires careful attention to chunking, retrieval, and evaluation. AWS Bedrock provides a solid foundation with managed infrastructure, enterprise security, and access to state-of-the-art models like Claude.

Start simple with basic semantic search, measure your baseline, then iterate with advanced techniques like hybrid search and query expansion. The key is continuous evaluation and improvement based on real user queries.

Need Help Building Your RAG System?

PATHSDATA specializes in production-ready Generative AI solutions on AWS. Let's discuss your use case.

RAGAWS BedrockClaudeVector DatabaseLangChainGenerative AI
P

PATHSDATA Team

AWS Select Tier Consulting Partner

We help enterprises build production-ready AI solutions on AWS. Specializing in RAG systems, data platforms, and MLOps.