RAG Explained: The 5 Steps That Make LLMs Smarter
A beginner-friendly breakdown of RAG's five core steps: from document preprocessing and chunking to embeddings, vector databases, and how LLMs use retrieved context to generate accurate answers.
What Is RAG?
RAG stands for Retrieval-Augmented Generation. It's a technique that allows LLMs to answer questions using your own documents and data, not just what they learned during training.
Before we dive into how it works, let me clarify four key concepts:
- RAG: The overall pipeline that combines retrieval with generation
- Embedding: A vector representation of text that captures its meaning
- Vector DB: A database specifically designed to store and search embeddings
- LLM: The language model that generates the final answer
Now let's walk through the five steps that make RAG work.
Step 1: Document Preprocessing
This is where we prepare our documents for the system. We clean up the text, remove unnecessary formatting, and get everything ready for the next stage.
Step 2: Chunking
Chunking means breaking documents into smaller pieces of a specific length.
For example, imagine you have a 20-sentence passage. You could chunk it so that each sentence becomes one chunk—giving you 20 chunks total. The size of your chunks matters because:
- Too large: The LLM gets irrelevant information
- Too small: You lose important context
# Simple chunking example
text = "Sentence 1. Sentence 2. Sentence 3."
chunks = text.split(". ")
# Result: ['Sentence 1', 'Sentence 2', 'Sentence 3']
Step 3: Creating Embeddings
Embeddings are vectors—numerical representations of text that capture meaning. We create them using special embedding models, which are LLMs specifically trained for this task.
Popular embedding models include:
text-embedding-3-smalltext-embedding-3-largegemini-embedding-001
These models convert each chunk into a vector. Chunks with similar meanings will have similar vectors.
# Conceptual example
chunk = "Machine learning is a subset of AI"
embedding = embedding_model.encode(chunk)
# Result: [0.234, -0.521, 0.832, ...] (hundreds of numbers)
Step 4: Storing in a Vector Database
A vector database (or metadata base) is specifically designed to store embeddings along with the original text and any metadata.
The power of vector databases is that they can quickly find chunks that are semantically similar to a query—even if the exact words don't match.
Step 5: LLM + Context
Here's where everything comes together:
- User asks a query: "What is machine learning?"
- We convert the query into an embedding
- Vector database retrieves the most relevant chunks
- We feed the LLM both the user query AND the retrieved chunks
- LLM generates an answer based on the actual content from your documents
# The flow at query time
user_query = "What is machine learning?"
query_embedding = embedding_model.encode(user_query)
relevant_chunks = vector_db.search(query_embedding, top_k=3)
prompt = f"""
Context: {relevant_chunks}
User Question: {user_query}
Answer based on the context above:
"""
answer = llm.generate(prompt)
Now the LLM has relevant information and can give a proper answer grounded in your actual documents, not just making things up.
Why This Matters
This is the basic RAG pipeline. There are many different types of RAG architectures emerging—advanced variations with query rewriting, multi-step retrieval, and hybrid approaches—but they all build on these five fundamental steps.
Once you understand this foundation, exploring more complex RAG patterns becomes much easier.