Understand RAG in One Article: The Complete Technical Flow from Indexing to Retrieval-Augmented Generation

Introduction

Have you noticed that large language models are a bit like top students taking a closed-book exam? They can talk about almost anything, but the moment you ask about something they never learned, they start making things up.

Ask for your company’s revenue from last year, and it guesses. Ask for your product’s support policy, and it guesses. Ask about an industry regulation released three days ago, and it still guesses.

This is not because the model wants to deceive you. Its knowledge is effectively frozen at the moment training ends. For anything beyond that, it can only infer.

So is there a way to turn the model from “closed-book” into “open-book”, letting it look things up before answering?

Yes. That approach is RAG.

What Is RAG? One-Sentence Version

RAG = retrieve first, generate second.

When a user asks a question, the system does not immediately hand it to the model. It first searches a knowledge base, finds content related to the question, and then sends both the question and the retrieved material to the model. The model uses those references to produce the final answer.

A rough analogy: if you ask a lawyer to analyze a contract, they usually do not answer purely from memory. They first look up the relevant clauses and precedents, then explain the conclusion with those materials in hand. RAG gives a model that same “look it up first” ability.

The benefits are straightforward:

Answers are grounded instead of improvised
Knowledge can be updated continuously without retraining the model
Private data can be used, such as internal company documents and manuals

When Do You Really Need RAG?

Not every use case needs RAG. If you want a model to write poetry, translate text, or polish copy, the model alone is often enough. But RAG becomes critical when your scenario has the following traits:

The knowledge is private. Internal product docs, operations manuals, and customer service records were never part of the model’s training data. Without RAG, the model knows nothing about them.

The knowledge changes over time. Laws are revised, financial policies shift, and new vulnerabilities appear constantly. A model’s training data has a cutoff date, so anything after that is outside its native memory.

The cost of mistakes is high. In medicine, law, and finance, a wrong answer can have serious consequences. Relying only on the model’s memory is too risky. You need a reliable source of truth behind the answer.

The Two Main Pipelines of RAG

To understand RAG, the most important thing is to separate its two timelines:

Offline stage: build the index. Before any user asks a question, you organize your documents and turn them into a searchable knowledge base. This is called index construction.

Online stage: answer the question. After a user asks something, the system retrieves relevant content from the knowledge base in real time and feeds it to the model for answer generation. This is called retrieval-generation.

These two pipelines happen at different times, but they work together. Let’s break them down one by one.

Offline Stage: How Documents Become a Searchable Knowledge Base

Suppose you have a pile of PDFs, Word files, web pages, and Markdown docs. Can you just throw all of them directly at a model? In theory maybe, but in practice no. The documents are too large, too messy, and too inconsistent in format.

So you need a processing pipeline to convert raw documents into a searchable knowledge base. In most systems, that pipeline has four steps:

Step 1: Cleaning and Standardization

Raw documents come in all kinds of forms. PDFs may contain garbled text, Word files may carry hidden formatting, and web pages often include ads and navigation noise. This step cleans that up by removing useless characters, normalizing encoding, and converting content into a consistent structure.

It is like turning scattered lecture slides, notes, and screenshots into a clean study guide before an exam. The quality of that guide directly affects how efficient the next steps will be.

This is also one of the most overlooked yet most critical parts of a RAG system. If your data quality is poor, later optimization will not save you.

Step 2: Chunking

A single document can easily be tens of thousands of words long. You cannot send the whole thing to the model every time. Context windows are limited, and most of the content will be irrelevant to the question anyway.

So documents are split into smaller pieces, or chunks. At retrieval time, the system only sends the few chunks that are most relevant.

Chunking sounds easy, but it has real trade-offs. If chunks are too large, one chunk may mix several topics together, which hurts retrieval precision. If chunks are too small, important context gets broken apart and meaning may be lost.

Common strategies include:

splitting by fixed length
splitting by paragraph or section boundaries
using semantic models to decide where a chunk should end

There is no universally best chunk size. It depends on the structure and style of your documents.

Step 3: Embedding

This is the “magic” part at the core of RAG.

Once chunking is done, each text chunk is passed through an embedding model, which converts it into a long list of numbers, called a vector. You do not need to know the model internals to understand the key idea: texts with similar meaning end up close to each other in vector space.

“The weather is great today” and “It is sunny outside” may share few exact words, but they mean similar things, so their vectors will be close together. On the other hand, “The weather is great today” and “How do I install Docker?” will be far apart.

That is the purpose of embeddings. They turn human meaning into mathematical distance that machines can calculate.

The most common similarity metric is cosine similarity, which compares whether two vectors point in roughly the same direction. Other metrics, such as Euclidean distance or dot product, are also used depending on the scenario.

Step 4: Store Everything in a Vector Database

After embedding, each chunk now has a vector representation. Those vectors need a dedicated storage engine: a vector database.

Why not use MySQL? Because traditional databases are optimized for exact matches, such as finding rows where a name equals “Alice”. Vector databases are optimized for nearest-neighbor similarity search, such as finding the five chunks most semantically similar to a query.

Each stored record usually contains three parts:

the vector itself for similarity search
the original text to feed back into the model later
metadata such as file name, section, timestamp, or permission scope for filtering and traceability

Common choices include Milvus, Pinecone, and Qdrant. If you already rely on PostgreSQL, pgvector is also a practical option.

At this point, the offline indexing stage is done. The knowledge base is ready, and the system can move on to online question answering.

Online Stage: What Happens After the User Asks a Question

Once a user submits a question, the system usually has to complete the full “find evidence + write answer” process in milliseconds to seconds. A common pipeline has five steps:

1. Embed the Question

The user’s question is also text, so the system converts it into a vector using the same embedding model used in the indexing stage. That “same model” detail matters. If the vectors are produced in different embedding spaces, they are no longer directly comparable.

2. Similarity Retrieval

The system uses the question vector to search the vector database for the closest matches, often taking a first-pass Top 3 to Top 5. This is the initial recall stage. The goal is to pull in as many potentially relevant candidates as possible without missing key evidence.

However, pure vector search has a weakness: it captures semantic similarity well, but it can miss exact keyword requirements. For example, if a user asks for the fix to “CVE-2024-1234”, vector search may return chunks about vulnerability remediation in general without returning the exact CVE entry.

That is why many production systems use hybrid retrieval, combining vector search with keyword-based methods such as BM25. This gives you both semantic coverage and lexical precision.

3. Re-ranking

The first retrieval pass usually returns a mixed bag. Some results are excellent, some are merely related. Re-ranking applies a more precise, usually slower model to score and reorder those candidates so the strongest evidence moves to the top.

You can think of initial recall as the preliminary round and re-ranking as the final round. By the end of both, the model receives much cleaner context.

4. Assemble the Prompt

The selected chunks are combined with the user’s original question to form an augmented prompt, something like this:

Please answer the question based on the following reference materials:

Reference 1: ...
Reference 2: ...

User question: What is the fix for CVE-2024-1234?

Prompt design also affects answer quality directly. Even with the same retrieval results, different instructions can lead to very different outputs.

5. Let the Model Generate the Final Answer

Finally, the augmented prompt is sent to the language model. Because the answer is generated with reference material in context, the risk of unsupported fabrication is much lower.

Of course, if retrieval itself goes wrong, the model can still answer incorrectly. That is why a good RAG system is never just “pick a model and ship it”. You also have to tune the data, chunking strategy, and retrieval pipeline.

If your index stores metadata such as source links, file names, or section headers, the system can also include citations in the final answer, which further improves trustworthiness.

Summary

Here is a compact recap of the two RAG pipelines:

Stage	Timing	What Happens
Index Construction	Offline, before the user asks	document cleaning → chunking → embedding → vector database storage
Retrieval-Generation	Online, when the user asks	query embedding → similarity retrieval → re-ranking → prompt assembly → answer generation

The core idea is simple: do not make the model take a closed-book exam. Give it an open-book workflow.

RAG is not a silver bullet. If recall quality is poor, chunks are cut at the wrong granularity, or the re-ranker performs badly, final answer quality will still suffer. In real systems, the biggest quality differences often come from data quality, chunking strategy, and retrieval tuning.

RAG itself is also evolving. Since 2025, newer patterns such as Agentic RAG and GraphRAG have gained traction. But the core logic has not changed. Once you understand indexing and retrieval-generation, every new RAG variant becomes easier to reason about.

Follow FishTech Notes if you want to keep exchanging practical AI engineering ideas.