TL;DR
RAG (Retrieval-Augmented Generation) is a pattern where an LLM answers questions using your own documents instead of relying on its training data alone. You split your documents into chunks, convert them into embeddings (numerical vectors), store those vectors in a database, and at query time retrieve the most relevant chunks to feed into the LLM prompt alongside the user question. The result: grounded answers with citations, not hallucinations.
The Problem RAG Solves
Large language models are trained on public internet data with a fixed cutoff date. They do not know about your internal documents, your product catalog, or your company policies. When you ask a question they cannot answer from training data, they either refuse or, worse, fabricate a confident-sounding answer. This is the hallucination problem.
RAG solves this by giving the model access to external knowledge at inference time. Instead of retraining the model (which is expensive and slow), you retrieve relevant context from a knowledge base and inject it directly into the prompt. The model generates its answer grounded in that context. According to Techment Technology's 2026 analysis of RAG architectures, this approach reduces hallucinations by 40-60% compared to base model responses while keeping the knowledge base updatable without model retraining.
How Embeddings Work
The foundation of RAG is the embedding: a numerical representation of text that captures its semantic meaning. When you embed the sentence "How do I reset my password?" and the sentence "Steps to change your login credentials," they end up as vectors that are close together in high-dimensional space, even though they share almost no words.
An embedding model (such as bge-base-en-v1.5) takes a chunk of text and outputs a fixed-length array of floating-point numbers, typically 768 dimensions. Each dimension encodes some aspect of the text meaning. Two chunks about similar topics will have high cosine similarity between their vectors. Two unrelated chunks will have low similarity.
"How do I reset my password?"
|
v
[Embedding Model]
|
v
[0.023, -0.118, 0.445, ... 768 dimensions]
|
v
Store in vector database
This is the indexing phase. You run it once for your entire document set, and then again incrementally whenever documents change.
Vector Search: Finding What Matters
When a user asks a question, the same embedding model converts their query into a vector. The vector database then performs a similarity search: it compares the query vector against every stored document vector and returns the top-k most similar chunks.
Cloudflare Vectorize, for example, uses hierarchical navigable small world (HNSW) indexing to perform this search in milliseconds, even across millions of vectors. The search is approximate, trading a tiny amount of precision for dramatic speed gains. In practice, the top 3-5 results almost always contain the relevant context.
Morphik's 2026 analysis of retrieval strategies identifies three main approaches: dense retrieval (pure vector similarity, best for semantic matching), sparse retrieval (keyword-based like BM25, best for exact terms), and hybrid retrieval (combining both). Production RAG systems increasingly use the hybrid approach to handle both conceptual questions and specific keyword lookups.
The Retrieval + Generation Pipeline
With the relevant chunks retrieved, the full RAG pipeline looks like this:
[User Question]
|
v
[Embed Query] ----> [Vector DB: Similarity Search]
|
v
[Top-K Chunks Retrieved]
|
v
[Construct Prompt]
"Given these documents:"
+ retrieved chunks
+ "Answer this question:"
+ user query
|
v
[LLM Generates Answer]
|
v
[Response with Citations]
The prompt is the critical integration point. It typically includes a system instruction ("Answer only based on the provided documents. If you cannot answer, say so."), the retrieved document chunks with source metadata, and the user question. The model sees the documents as context and generates an answer that references them. A well-built RAG system returns not just the answer but the specific source passages that support it.
RAG vs Fine-Tuning: When to Use Which
These two approaches solve different problems, and choosing the wrong one is a common mistake.
Use RAG when:
- Your data changes frequently. Product catalogs, support docs, policy manuals. RAG lets you update the knowledge base without touching the model.
- You need citations. RAG can point to the exact source document. Fine-tuned models cannot.
- You have a large, diverse document set. Thousands of documents across different topics. RAG retrieves only what is relevant per query.
- You want to start fast. RAG requires no model training. You can go from documents to working Q&A in hours.
Use fine-tuning when:
- You need the model to adopt a specific style or tone. A customer service voice, a legal writing style, a brand personality.
- The task is narrow and well-defined. Classification, extraction, or formatting tasks where the model needs to learn a pattern, not recall facts.
- Latency is critical. Fine-tuned models produce answers in a single call. RAG adds the embedding and retrieval step.
In practice, many production systems combine both: a fine-tuned model for tone and task structure, with RAG for factual grounding. The Techment 2026 analysis notes that this hybrid approach is becoming the default architecture for enterprise knowledge applications.
See It in Action
The RAG and Knowledge Base demo lets you ask questions against a seeded knowledge base and watch the retrieval + generation pipeline execute in real time. You can see which document chunks are retrieved, how they score, and how the LLM uses them to construct an answer.
Sources & Further Reading
- Techment Technology: Retrieval-Augmented Generation (RAG) in 2026. Architecture patterns, hallucination reduction metrics, hybrid RAG + fine-tuning approaches.
- Morphik: Retrieval Strategies for RAG. Dense vs sparse vs hybrid retrieval, re-ranking, and chunk optimization techniques.
- Cloudflare: Vectorize Documentation. HNSW indexing, vector similarity search, embedding storage and querying at the edge.
- Lewis et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020). The original RAG paper from Facebook AI Research.
- Cloudflare Workers AI: bge-base-en-v1.5. The embedding model used in the demo: 768-dimensional vectors, optimized for retrieval tasks.