How do embeddings work in RAG?

An embedding model takes a chunk of text and outputs a fixed-length array of floating-point numbers, typically 768 dimensions. Each dimension encodes some aspect of the text's meaning. Two chunks about similar topics will have high cosine similarity between their vectors, enabling semantic search even when the exact words differ.

What is vector search and how does it find relevant documents?

When a user asks a question, the same embedding model converts the query into a vector. The vector database performs a similarity search, comparing the query vector against every stored document vector and returning the top-k most similar chunks. HNSW indexing performs this search in milliseconds, even across millions of vectors.

When should I use RAG instead of fine-tuning?

Use RAG when your data changes frequently, you need citations pointing to source documents, you have a large diverse document set, or you want to start fast without model training. Use fine-tuning when you need a specific style or tone, the task is narrow and well-defined, or latency is critical.

What Is RAG? A Visual Explanation

Q: What is Retrieval-Augmented Generation (RAG)?

RAG is a pattern where a large language model answers questions using your own documents instead of relying on its training data alone. You split documents into chunks, convert them into embeddings (numerical vectors), store those vectors in a database, and at query time retrieve the most relevant chunks to feed into the LLM prompt alongside the user question.

Q: What are the main retrieval strategies in RAG systems?

There are three main approaches: dense retrieval (pure vector similarity, best for semantic matching), sparse retrieval (keyword-based like BM25, best for exact terms), and hybrid retrieval (combining both). Production RAG systems increasingly use the hybrid approach to handle both conceptual questions and specific keyword lookups.

TL;DR

RAG (Retrieval-Augmented Generation) is a pattern where an LLM answers questions using your own documents instead of relying on its training data alone. You split your documents into chunks, convert them into embeddings (numerical vectors), store those vectors in a database, and at query time retrieve the most relevant chunks to feed into the LLM prompt alongside the user question. The result: grounded answers with citations, not hallucinations.

The Problem RAG Solves

Large language models are trained on public internet data with a fixed cutoff date. They do not know about your internal documents, your product catalog, or your company policies. When you ask a question they cannot answer from training data, they either refuse or, worse, fabricate a confident-sounding answer. This is the hallucination problem.

RAG solves this by giving the model access to external knowledge at inference time. Instead of retraining the model (which is expensive and slow), you retrieve relevant context from a knowledge base and inject it directly into the prompt. The model generates its answer grounded in that context. According to Techment Technology's 2026 analysis of RAG architectures, this approach reduces hallucinations by 40-60% compared to base model responses while keeping the knowledge base updatable without model retraining.

How Embeddings Work

The foundation of RAG is the embedding: a numerical representation of text that captures its semantic meaning. When you embed the sentence "How do I reset my password?" and the sentence "Steps to change your login credentials," they end up as vectors that are close together in high-dimensional space, even though they share almost no words.

An embedding model (such as bge-base-en-v1.5) takes a chunk of text and outputs a fixed-length array of floating-point numbers, typically 768 dimensions. Each dimension encodes some aspect of the text meaning. Two chunks about similar topics will have high cosine similarity between their vectors. Two unrelated chunks will have low similarity.

  "How do I reset my password?"
       |
       v
  [Embedding Model]
       |
       v
  [0.023, -0.118, 0.445, ... 768 dimensions]
       |
       v
  Store in vector database

This is the indexing phase. You run it once for your entire document set, and then again incrementally whenever documents change.

Vector Search: Finding What Matters

When a user asks a question, the same embedding model converts their query into a vector. The vector database then performs a similarity search: it compares the query vector against every stored document vector and returns the top-k most similar chunks.

Cloudflare Vectorize, for example, uses hierarchical navigable small world (HNSW) indexing to perform this search in milliseconds, even across millions of vectors. The search is approximate, trading a tiny amount of precision for dramatic speed gains. In practice, the top 3-5 results almost always contain the relevant context.

Morphik's 2026 analysis of retrieval strategies identifies three main approaches: dense retrieval (pure vector similarity, best for semantic matching), sparse retrieval (keyword-based like BM25, best for exact terms), and hybrid retrieval (combining both). Production RAG systems increasingly use the hybrid approach to handle both conceptual questions and specific keyword lookups.

The Retrieval + Generation Pipeline

With the relevant chunks retrieved, the full RAG pipeline looks like this:

  [User Question]
       |
       v
  [Embed Query] ----> [Vector DB: Similarity Search]
                              |
                              v
                       [Top-K Chunks Retrieved]
                              |
                              v
                    [Construct Prompt]
                    "Given these documents:"
                    + retrieved chunks
                    + "Answer this question:"
                    + user query
                              |
                              v
                         [LLM Generates Answer]
                              |
                              v
                    [Response with Citations]

Embeddings Vector Database Similarity Search Prompt Assembly LLM Generation

The prompt is the critical integration point. It typically includes a system instruction ("Answer only based on the provided documents. If you cannot answer, say so."), the retrieved document chunks with source metadata, and the user question. The model sees the documents as context and generates an answer that references them. A well-built RAG system returns not just the answer but the specific source passages that support it.

RAG vs Fine-Tuning: When to Use Which

These two approaches solve different problems, and choosing the wrong one is a common mistake.

Use RAG when:

Your data changes frequently. Product catalogs, support docs, policy manuals. RAG lets you update the knowledge base without touching the model.
You need citations. RAG can point to the exact source document. Fine-tuned models cannot.
You have a large, diverse document set. Thousands of documents across different topics. RAG retrieves only what is relevant per query.
You want to start fast. RAG requires no model training. You can go from documents to working Q&A in hours.

Use fine-tuning when:

You need the model to adopt a specific style or tone. A customer service voice, a legal writing style, a brand personality.
The task is narrow and well-defined. Classification, extraction, or formatting tasks where the model needs to learn a pattern, not recall facts.
Latency is critical. Fine-tuned models produce answers in a single call. RAG adds the embedding and retrieval step.

In practice, many production systems combine both: a fine-tuned model for tone and task structure, with RAG for factual grounding. The Techment 2026 analysis notes that this hybrid approach is becoming the default architecture for enterprise knowledge applications.

See It in Action

The RAG and Knowledge Base demo lets you ask questions against a seeded knowledge base and watch the retrieval + generation pipeline execute in real time. You can see which document chunks are retrieved, how they score, and how the LLM uses them to construct an answer.

RAG & Knowledge Base Demo →

Sources & Further Reading

Techment Technology: Retrieval-Augmented Generation (RAG) in 2026. Architecture patterns, hallucination reduction metrics, hybrid RAG + fine-tuning approaches.
Morphik: Retrieval Strategies for RAG. Dense vs sparse vs hybrid retrieval, re-ranking, and chunk optimization techniques.
Cloudflare: Vectorize Documentation. HNSW indexing, vector similarity search, embedding storage and querying at the edge.
Lewis et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020). The original RAG paper from Facebook AI Research.
Cloudflare Workers AI: bge-base-en-v1.5. The embedding model used in the demo: 768-dimensional vectors, optimized for retrieval tasks.