RAG Explained: The 5 Steps That Make LLMs Smarter

What Is RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique that allows LLMs to answer questions using your own documents and data, not just what they learned during training.

Before we dive into how it works, let me clarify four key concepts:

RAG: The overall pipeline that combines retrieval with generation
Embedding: A vector representation of text that captures its meaning
Vector DB: A database specifically designed to store and search embeddings
LLM: The language model that generates the final answer

Now let's walk through the five steps that make RAG work.

Step 1: Document Preprocessing

This is where we prepare our documents for the system. We clean up the text, remove unnecessary formatting, and get everything ready for the next stage.

Step 2: Chunking

Chunking means breaking documents into smaller pieces of a specific length.

For example, imagine you have a 20-sentence passage. You could chunk it so that each sentence becomes one chunk—giving you 20 chunks total. The size of your chunks matters because:

Too large: The LLM gets irrelevant information
Too small: You lose important context

# Simple chunking example
text = "Sentence 1. Sentence 2. Sentence 3."
chunks = text.split(". ")
# Result: ['Sentence 1', 'Sentence 2', 'Sentence 3']

Step 3: Creating Embeddings

Embeddings are vectors—numerical representations of text that capture meaning. We create them using special embedding models, which are LLMs specifically trained for this task.

Popular embedding models include:

text-embedding-3-small
text-embedding-3-large
gemini-embedding-001

These models convert each chunk into a vector. Chunks with similar meanings will have similar vectors.

# Conceptual example
chunk = "Machine learning is a subset of AI"
embedding = embedding_model.encode(chunk)
# Result: [0.234, -0.521, 0.832, ...] (hundreds of numbers)

Step 4: Storing in a Vector Database

A vector database (or metadata base) is specifically designed to store embeddings along with the original text and any metadata.

The power of vector databases is that they can quickly find chunks that are semantically similar to a query—even if the exact words don't match.

Step 5: LLM + Context

Here's where everything comes together:

User asks a query: "What is machine learning?"
We convert the query into an embedding
Vector database retrieves the most relevant chunks
We feed the LLM both the user query AND the retrieved chunks
LLM generates an answer based on the actual content from your documents

# The flow at query time
user_query = "What is machine learning?"
query_embedding = embedding_model.encode(user_query)
relevant_chunks = vector_db.search(query_embedding, top_k=3)

prompt = f"""
Context: {relevant_chunks}

User Question: {user_query}

Answer based on the context above:
"""

answer = llm.generate(prompt)

Now the LLM has relevant information and can give a proper answer grounded in your actual documents, not just making things up.

Why This Matters

This is the basic RAG pipeline. There are many different types of RAG architectures emerging—advanced variations with query rewriting, multi-step retrieval, and hybrid approaches—but they all build on these five fundamental steps.

Once you understand this foundation, exploring more complex RAG patterns becomes much easier.