Building Production-Ready RAG Systems with AWS Bedrock

1Introduction to RAG

Retrieval Augmented Generation (RAG) has emerged as the go-to architecture for building AI applications that need access to private or current information. Unlike fine-tuning, RAG allows you to ground LLM responses in your own data without the cost and complexity of model training.

Why RAG over Fine-tuning?

RAG is ideal when your data changes frequently, you need source attribution, or you want to avoid the cost and latency of fine-tuning. Fine-tuning is better for teaching the model new behaviors or specialized reasoning patterns.

Key Benefits of RAG

Access to private, proprietary data without fine-tuning
Real-time information retrieval (no knowledge cutoff)
Source attribution and explainability
Lower cost than fine-tuning or training custom models
Easy to update knowledge base without retraining

2RAG Architecture Overview

A production RAG system consists of two main pipelines: the ingestion pipeline (processing and storing documents) and the retrieval pipeline (answering queries).

RAG Architecture on AWS

Ingestion Pipeline

Documents

S3, APIs, DBs

↓

Chunking

Split & Process

↓

Embeddings

Titan Embeddings

Vector Store

OpenSearch Serverless

or Pinecone / pgvector

Query Pipeline

User Query

Natural Language

↓

Retrieval

Semantic Search

↓

Claude 3

AWS Bedrock

3Setting Up AWS Bedrock

Amazon Bedrock provides access to foundation models from Anthropic (Claude), Amazon (Titan), and others through a unified API. Here's how to set up Bedrock with LangChain:

Pythonbedrock_setup.py

import boto3
from langchain_aws import ChatBedrock
from langchain_aws import BedrockEmbeddings

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'
)

# Initialize Claude model
llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    client=bedrock_runtime,
    model_kwargs={
        "temperature": 0.1,
        "max_tokens": 4096
    }
)

# Initialize embeddings model
embeddings = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v2:0"
)

Prerequisites

Ensure you have enabled model access in the AWS Bedrock console and have the appropriate IAM permissions (bedrock:InvokeModel).

Choosing the Right Model

Model	Best For	Context	Cost
Claude 3.5 Sonnet	Best balance of speed & quality	200K tokens	$$
Claude 3 Opus	Complex reasoning tasks	200K tokens	$$$$
Claude 3 Haiku	High-volume, simple queries	200K tokens	$
Titan Embeddings v2	Document embeddings	8K tokens	$

4Choosing a Vector Database

The vector database stores document embeddings and enables semantic search. Here are the top options for AWS deployments:

OpenSearch Serverless

Best for: Enterprise production

✓ Fully managed

✓ Hybrid search

✓ Native AWS integration

✗ Higher cost at scale

Pinecone

Best for: Quick prototypes

✓ Simple API

✓ Fast queries

✓ Good free tier

✗ External service

✗ Data leaves AWS

pgvector (RDS)

Best for: Small datasets

✓ Low cost

✓ Familiar SQL

✓ Single database

✗ Manual scaling

✗ Limited features

OpenSearch Serverless Setup

Pythonvector_store.py

from langchain_community.vectorstores import OpenSearchVectorSearch
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

# AWS authentication for OpenSearch
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    'us-east-1',
    'aoss',
    session_token=credentials.token
)

# Initialize OpenSearch vector store
vector_store = OpenSearchVectorSearch(
    index_name="knowledge-base",
    embedding_function=embeddings,
    opensearch_url="https://your-collection.us-east-1.aoss.amazonaws.com",
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
)

5Document Chunking Strategies

Chunking is one of the most impactful decisions in RAG. Poor chunking leads to poor retrieval, regardless of how good your LLM is.

The Golden Rule of Chunking

Each chunk should contain enough context to answer a question on its own. If a chunk requires the previous or next chunk to make sense, your chunks are too small.

Recommended Chunking Strategy

Pythonchunking.py

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Optimized chunking strategy
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Optimal for most use cases
    chunk_overlap=200,      # 20% overlap for context preservation
    length_function=len,
    separators=[
        "\n\n",           # Paragraph breaks (highest priority)
        "\n",              # Line breaks
        ". ",               # Sentences
        ", ",               # Clauses
        " ",                # Words
        ""                  # Characters (fallback)
    ]
)

# Process documents
chunks = text_splitter.split_documents(documents)

# Add metadata for better retrieval
for i, chunk in enumerate(chunks):
    chunk.metadata["chunk_id"] = i
    chunk.metadata["source_file"] = document_name
    chunk.metadata["timestamp"] = datetime.now().isoformat()

Chunk Size Guidelines

Content Type	Chunk Size	Overlap
Technical documentation	1000-1500 tokens	200 tokens (20%)
Legal documents	500-800 tokens	100 tokens (15%)
Q&A / FAQs	Per question	None
Code files	Per function/class	Include imports

6Advanced Retrieval Techniques

Basic semantic search gets you 70% of the way there. These advanced techniques can significantly improve retrieval quality:

1. Hybrid Search (Semantic + Keyword)

Combining vector similarity with traditional BM25 keyword search often outperforms either approach alone, especially for queries with specific terms or names.

Pythonhybrid_search.py

from opensearchpy import OpenSearch

# Hybrid search combining semantic + keyword search
def hybrid_search(query: str, k: int = 5, alpha: float = 0.7):
    """
    Perform hybrid search with configurable weighting.
    alpha: 0.0 = pure keyword, 1.0 = pure semantic
    """
    
    # Get query embedding
    query_embedding = embeddings.embed_query(query)
    
    # Hybrid search query
    search_query = {
        "size": k,
        "query": {
            "hybrid": {
                "queries": [
                    # Semantic search (vector)
                    {
                        "knn": {
                            "embedding": {
                                "vector": query_embedding,
                                "k": k
                            }
                        }
                    },
                    # Keyword search (BM25)
                    {
                        "match": {
                            "text": {
                                "query": query,
                                "boost": 1.0
                            }
                        }
                    }
                ]
            }
        },
        # Reranking for better results
        "search_pipeline": "hybrid-search-pipeline"
    }
    
    results = opensearch_client.search(
        index="knowledge-base",
        body=search_query
    )
    
    return results["hits"]["hits"]

2. Query Expansion

Use the LLM to generate alternative phrasings of the user's query before searching:

Original Query

"How do I fix the login error?"

Expanded Queries

• "authentication failure troubleshooting"
• "login error resolution steps"
• "sign in problem fix"

3. Contextual Compression

After retrieval, use an LLM to extract only the relevant portions of each chunk. This reduces noise and allows you to include more documents in the context.

4. Metadata Filtering

Add metadata to chunks (date, source, category) and filter before or after retrieval:

Filter by date for time-sensitive queries
Filter by source for domain-specific questions
Filter by access level for multi-tenant systems

7Evaluation & Monitoring

You can't improve what you don't measure. Use RAGAS (Retrieval Augmented Generation Assessment) to evaluate your RAG pipeline:

Pythonevaluation.py

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

# Prepare evaluation dataset
eval_dataset = {
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": expected_answers
}

# Run RAGAS evaluation
results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,        # Is answer faithful to context?
        answer_relevancy,    # Is answer relevant to question?
        context_precision,   # Are retrieved docs relevant?
        context_recall       # Are all relevant docs retrieved?
    ]
)

print(f"Faithfulness: {results['faithfulness']:.2f}")
print(f"Answer Relevancy: {results['answer_relevancy']:.2f}")
print(f"Context Precision: {results['context_precision']:.2f}")
print(f"Context Recall: {results['context_recall']:.2f}")

Key Metrics to Track

Faithfulness

> 0.85

Is the answer grounded in the retrieved context?

Answer Relevancy

> 0.80

Does the answer address the question?

Context Precision

> 0.75

Are the retrieved docs relevant?

Context Recall

> 0.70

Were all relevant docs retrieved?

8Production Best Practices

Implement Caching

Cache embeddings for repeated queries. Use ElastiCache or DynamoDB to reduce latency and costs.

Set Up Guardrails

Use Bedrock Guardrails to filter harmful content, PII, and off-topic responses.

Monitor Costs

Track token usage per query. Set up CloudWatch alarms for unexpected spikes.

Handle Failures Gracefully

Implement fallbacks: if retrieval fails, acknowledge the limitation rather than hallucinating.

Version Your Knowledge Base

Track document versions. Allow rollback if new documents degrade quality.

Implement Feedback Loops

Collect user feedback (thumbs up/down). Use it to identify retrieval failures and improve prompts.

Conclusion

Building a production-ready RAG system requires careful attention to chunking, retrieval, and evaluation. AWS Bedrock provides a solid foundation with managed infrastructure, enterprise security, and access to state-of-the-art models like Claude.

Start simple with basic semantic search, measure your baseline, then iterate with advanced techniques like hybrid search and query expansion. The key is continuous evaluation and improvement based on real user queries.

Need Help Building Your RAG System?

PATHSDATA specializes in production-ready Generative AI solutions on AWS. Let's discuss your use case.

Schedule a Consultation Explore AWS POC Program

Building Production-Ready RAG Systems with AWS Bedrock

1Introduction to RAG

Key Benefits of RAG

2RAG Architecture Overview

RAG Architecture on AWS

3Setting Up AWS Bedrock

Choosing the Right Model

4Choosing a Vector Database

OpenSearch Serverless

Pinecone

pgvector (RDS)

OpenSearch Serverless Setup

5Document Chunking Strategies

Recommended Chunking Strategy

Chunk Size Guidelines

6Advanced Retrieval Techniques

1. Hybrid Search (Semantic + Keyword)

2. Query Expansion

3. Contextual Compression

4. Metadata Filtering

7Evaluation & Monitoring

Key Metrics to Track

Faithfulness

Answer Relevancy

Context Precision

Context Recall

8Production Best Practices

Implement Caching

Set Up Guardrails

Monitor Costs

Handle Failures Gracefully

Version Your Knowledge Base

Implement Feedback Loops

Conclusion

Need Help Building Your RAG System?

Related Articles

Why Apache Iceberg is the Future of Data Lakehouses