How to Build a Production RAG System in 2026
Retrieval Augmented Generation (RAG) is the technique that lets you ask an LLM questions about your own documents, codebase, or knowledge base — and get accurate, cited answers.
This guide covers building a production RAG system from scratch. By the end, you'll understand every component and why it matters.
What is RAG?
Without RAG, LLMs can only answer questions about information they were trained on (knowledge cutoff). RAG solves this by:
- Ingesting your documents into a vector database
- Retrieving relevant chunks when a question is asked
- Augmenting the LLM prompt with those chunks
- Generating an answer grounded in your specific documents
The LLM becomes a reasoning engine on top of your data.
Architecture Overview
[Documents] → [Chunking] → [Embedding] → [Vector DB]
↓
[User Query] → [Query Embedding] → [Similarity Search] → [Top K Chunks]
↓
[LLM Prompt + Chunks] → [Answer]
Step 1: Document Ingestion
Chunking Strategy
How you split documents dramatically impacts quality:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Good default chunking
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=50, # overlap prevents context breaks
length_function=len,
separators=["\n\n", "\n", ". ", " "], # respect document structure
)
docs = splitter.split_documents(raw_documents)
Chunking rules of thumb:
- Code: chunk by function/class, not by character count
- Prose: 512–1024 tokens with 10% overlap
- Tables/structured data: keep rows together
- Short documents (< 500 tokens): don't chunk at all
Always attach metadata to chunks:
for chunk in docs:
chunk.metadata.update({
"source": file_path,
"section": extract_heading(chunk.page_content),
"date_indexed": datetime.now().isoformat(),
"content_type": "documentation", # or "code", "blog", "policy"
})
Step 2: Embeddings
Embeddings convert text to vectors for semantic search.
Choose an Embedding Model
| Model | Dimensions | Best For | Cost |
|---|
| text-embedding-3-small | 1536 | General use | $0.02/1M tokens |
| text-embedding-3-large | 3072 | High accuracy | $0.13/1M tokens |
| nomic-embed-text (local) | 768 | Privacy/cost | Free |
| mxbai-embed-large (local) | 1024 | Best local | Free |
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import OllamaEmbeddings
# Cloud (OpenAI)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Local (Ollama - free)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
Step 3: Vector Database
Choosing a Vector DB
| Database | Best For | Setup |
|---|
| Chroma | Local development | In-memory, zero config |
| Pinecone | Production cloud | Managed, scalable |
| Weaviate | Hybrid search | Self-hosted or cloud |
| pgvector | Already using PostgreSQL | No new service |
| Qdrant | High performance self-hosted | Docker |
# Chroma (local dev)
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db"
)
# pgvector (if you already have PostgreSQL)
from langchain_postgres import PGVector
vectorstore = PGVector(
embeddings=embeddings,
collection_name="documents",
connection="postgresql+psycopg://user:pass@localhost:5432/db",
)
Step 4: Retrieval
Basic Similarity Search
query = "How do I reset a user's password?"
relevant_docs = vectorstore.similarity_search(query, k=5)
Hybrid Search (Better for Production)
Combine semantic search with keyword search:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Keyword-based retrieval
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 5
# Semantic retrieval
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine (60% semantic, 40% keyword)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, semantic_retriever],
weights=[0.4, 0.6]
)
Re-ranking
After retrieval, re-rank chunks by relevance to reduce noise:
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
reranker = CrossEncoderReranker(model_name="BAAI/bge-reranker-large", top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=ensemble_retriever
)
Step 5: Generation
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
llm = ChatAnthropic(model="claude-3-7-sonnet-20250219")
prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer isn't in the context, say "I don't have information about that."
Context:
{context}
Question: {question}
Provide a clear, concise answer with citations to specific sections when possible.
""")
def format_docs(docs):
return "\n\n".join([
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
])
# Chain it together
from langchain_core.runnables import RunnablePassthrough
rag_chain = (
{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
)
answer = rag_chain.invoke("What is the password reset policy?")
Step 6: Evaluation
Never deploy a RAG system without evaluating it.
# Use RAGAS for automated evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
# Create test dataset
test_dataset = [
{
"question": "What is the refund policy?",
"answer": rag_chain.invoke("What is the refund policy?").content,
"contexts": [doc.page_content for doc in retriever.invoke("What is the refund policy?")],
"ground_truth": "Refunds are available within 30 days of purchase."
}
# ... more test cases
]
results = evaluate(test_dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(results)
# → faithfulness: 0.92, answer_relevancy: 0.88, context_recall: 0.85
Production Checklist
Next Steps