Building Production-Ready RAG Systems with AWS Bedrock
A comprehensive guide to building scalable Retrieval Augmented Generation systems using Amazon Bedrock, Claude, and vector databases.
1Introduction to RAG
Retrieval Augmented Generation (RAG) has emerged as the go-to architecture for building AI applications that need access to private or current information. Unlike fine-tuning, RAG allows you to ground LLM responses in your own data without the cost and complexity of model training.
Why RAG over Fine-tuning?
RAG is ideal when your data changes frequently, you need source attribution, or you want to avoid the cost and latency of fine-tuning. Fine-tuning is better for teaching the model new behaviors or specialized reasoning patterns.
Key Benefits of RAG
- Access to private, proprietary data without fine-tuning
- Real-time information retrieval (no knowledge cutoff)
- Source attribution and explainability
- Lower cost than fine-tuning or training custom models
- Easy to update knowledge base without retraining
2RAG Architecture Overview
A production RAG system consists of two main pipelines: the ingestion pipeline (processing and storing documents) and the retrieval pipeline (answering queries).
RAG Architecture on AWS
Ingestion Pipeline
Documents
S3, APIs, DBs
Chunking
Split & Process
Embeddings
Titan Embeddings
Vector Store
OpenSearch Serverless
or Pinecone / pgvector
Query Pipeline
User Query
Natural Language
Retrieval
Semantic Search
Claude 3
AWS Bedrock
3Setting Up AWS Bedrock
Amazon Bedrock provides access to foundation models from Anthropic (Claude), Amazon (Titan), and others through a unified API. Here's how to set up Bedrock with LangChain:
import boto3
from langchain_aws import ChatBedrock
from langchain_aws import BedrockEmbeddings
# Initialize Bedrock client
bedrock_runtime = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1'
)
# Initialize Claude model
llm = ChatBedrock(
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
client=bedrock_runtime,
model_kwargs={
"temperature": 0.1,
"max_tokens": 4096
}
)
# Initialize embeddings model
embeddings = BedrockEmbeddings(
client=bedrock_runtime,
model_id="amazon.titan-embed-text-v2:0"
)Prerequisites
Ensure you have enabled model access in the AWS Bedrock console and have the appropriate IAM permissions (bedrock:InvokeModel).
Choosing the Right Model
| Model | Best For | Context | Cost |
|---|---|---|---|
| Claude 3.5 Sonnet | Best balance of speed & quality | 200K tokens | $$ |
| Claude 3 Opus | Complex reasoning tasks | 200K tokens | $$$$ |
| Claude 3 Haiku | High-volume, simple queries | 200K tokens | $ |
| Titan Embeddings v2 | Document embeddings | 8K tokens | $ |
4Choosing a Vector Database
The vector database stores document embeddings and enables semantic search. Here are the top options for AWS deployments:
OpenSearch Serverless
Best for: Enterprise production
✓ Fully managed
✓ Hybrid search
✓ Native AWS integration
✗ Higher cost at scale
Pinecone
Best for: Quick prototypes
✓ Simple API
✓ Fast queries
✓ Good free tier
✗ External service
✗ Data leaves AWS
pgvector (RDS)
Best for: Small datasets
✓ Low cost
✓ Familiar SQL
✓ Single database
✗ Manual scaling
✗ Limited features
OpenSearch Serverless Setup
from langchain_community.vectorstores import OpenSearchVectorSearch
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
# AWS authentication for OpenSearch
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
credentials.access_key,
credentials.secret_key,
'us-east-1',
'aoss',
session_token=credentials.token
)
# Initialize OpenSearch vector store
vector_store = OpenSearchVectorSearch(
index_name="knowledge-base",
embedding_function=embeddings,
opensearch_url="https://your-collection.us-east-1.aoss.amazonaws.com",
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
)5Document Chunking Strategies
Chunking is one of the most impactful decisions in RAG. Poor chunking leads to poor retrieval, regardless of how good your LLM is.
The Golden Rule of Chunking
Each chunk should contain enough context to answer a question on its own. If a chunk requires the previous or next chunk to make sense, your chunks are too small.
Recommended Chunking Strategy
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Optimized chunking strategy
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Optimal for most use cases
chunk_overlap=200, # 20% overlap for context preservation
length_function=len,
separators=[
"\n\n", # Paragraph breaks (highest priority)
"\n", # Line breaks
". ", # Sentences
", ", # Clauses
" ", # Words
"" # Characters (fallback)
]
)
# Process documents
chunks = text_splitter.split_documents(documents)
# Add metadata for better retrieval
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["source_file"] = document_name
chunk.metadata["timestamp"] = datetime.now().isoformat()Chunk Size Guidelines
| Content Type | Chunk Size | Overlap |
|---|---|---|
| Technical documentation | 1000-1500 tokens | 200 tokens (20%) |
| Legal documents | 500-800 tokens | 100 tokens (15%) |
| Q&A / FAQs | Per question | None |
| Code files | Per function/class | Include imports |
6Advanced Retrieval Techniques
Basic semantic search gets you 70% of the way there. These advanced techniques can significantly improve retrieval quality:
1. Hybrid Search (Semantic + Keyword)
Combining vector similarity with traditional BM25 keyword search often outperforms either approach alone, especially for queries with specific terms or names.
from opensearchpy import OpenSearch
# Hybrid search combining semantic + keyword search
def hybrid_search(query: str, k: int = 5, alpha: float = 0.7):
"""
Perform hybrid search with configurable weighting.
alpha: 0.0 = pure keyword, 1.0 = pure semantic
"""
# Get query embedding
query_embedding = embeddings.embed_query(query)
# Hybrid search query
search_query = {
"size": k,
"query": {
"hybrid": {
"queries": [
# Semantic search (vector)
{
"knn": {
"embedding": {
"vector": query_embedding,
"k": k
}
}
},
# Keyword search (BM25)
{
"match": {
"text": {
"query": query,
"boost": 1.0
}
}
}
]
}
},
# Reranking for better results
"search_pipeline": "hybrid-search-pipeline"
}
results = opensearch_client.search(
index="knowledge-base",
body=search_query
)
return results["hits"]["hits"]2. Query Expansion
Use the LLM to generate alternative phrasings of the user's query before searching:
Original Query
"How do I fix the login error?"
Expanded Queries
- • "authentication failure troubleshooting"
- • "login error resolution steps"
- • "sign in problem fix"
3. Contextual Compression
After retrieval, use an LLM to extract only the relevant portions of each chunk. This reduces noise and allows you to include more documents in the context.
4. Metadata Filtering
Add metadata to chunks (date, source, category) and filter before or after retrieval:
- Filter by date for time-sensitive queries
- Filter by source for domain-specific questions
- Filter by access level for multi-tenant systems
7Evaluation & Monitoring
You can't improve what you don't measure. Use RAGAS (Retrieval Augmented Generation Assessment) to evaluate your RAG pipeline:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
# Prepare evaluation dataset
eval_dataset = {
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": expected_answers
}
# Run RAGAS evaluation
results = evaluate(
dataset=eval_dataset,
metrics=[
faithfulness, # Is answer faithful to context?
answer_relevancy, # Is answer relevant to question?
context_precision, # Are retrieved docs relevant?
context_recall # Are all relevant docs retrieved?
]
)
print(f"Faithfulness: {results['faithfulness']:.2f}")
print(f"Answer Relevancy: {results['answer_relevancy']:.2f}")
print(f"Context Precision: {results['context_precision']:.2f}")
print(f"Context Recall: {results['context_recall']:.2f}")Key Metrics to Track
Faithfulness
> 0.85Is the answer grounded in the retrieved context?
Answer Relevancy
> 0.80Does the answer address the question?
Context Precision
> 0.75Are the retrieved docs relevant?
Context Recall
> 0.70Were all relevant docs retrieved?
8Production Best Practices
Implement Caching
Cache embeddings for repeated queries. Use ElastiCache or DynamoDB to reduce latency and costs.
Set Up Guardrails
Use Bedrock Guardrails to filter harmful content, PII, and off-topic responses.
Monitor Costs
Track token usage per query. Set up CloudWatch alarms for unexpected spikes.
Handle Failures Gracefully
Implement fallbacks: if retrieval fails, acknowledge the limitation rather than hallucinating.
Version Your Knowledge Base
Track document versions. Allow rollback if new documents degrade quality.
Implement Feedback Loops
Collect user feedback (thumbs up/down). Use it to identify retrieval failures and improve prompts.
Conclusion
Building a production-ready RAG system requires careful attention to chunking, retrieval, and evaluation. AWS Bedrock provides a solid foundation with managed infrastructure, enterprise security, and access to state-of-the-art models like Claude.
Start simple with basic semantic search, measure your baseline, then iterate with advanced techniques like hybrid search and query expansion. The key is continuous evaluation and improvement based on real user queries.
Need Help Building Your RAG System?
PATHSDATA specializes in production-ready Generative AI solutions on AWS. Let's discuss your use case.
PATHSDATA Team
AWS Select Tier Consulting Partner
We help enterprises build production-ready AI solutions on AWS. Specializing in RAG systems, data platforms, and MLOps.
