Guide · AI / Machine Learning

What is Retrieval-Augmented Generation (RAG)?

RAG gives large language models accurate, grounded access to information they were never trained on — your internal documents, your product database, your policy library. This guide explains how it works, when to use it, and how to build it correctly.

What RAG is

Retrieval-Augmented Generation is a pattern for grounding LLM responses in a specific body of knowledge at inference time. Instead of asking a model to rely on what it learned during training — which is static, potentially outdated, and scoped to public internet data — RAG retrieves relevant passages from a curated knowledge base and includes them in the prompt before the model generates a response.

The result is a system that can answer questions about your internal documentation, your product catalog, your legal contracts, or any proprietary corpus — accurately, with attribution, and without baking sensitive data into a model that is expensive and slow to retrain. RAG is not a single technology. It is an architectural pattern composed of several components that need to be designed and tuned together.

The term was introduced in a 2020 paper by Lewis et al. at Facebook AI Research, but the practical implementation has evolved significantly. Today, RAG systems range from simple retrieval pipelines using a single vector database to sophisticated agentic systems that perform multi-hop reasoning across heterogeneous data sources, rerank results, and verify citations before responding.

How RAG works: the full pipeline

A production RAG system has two phases: an offline indexing phase that processes and stores your knowledge base, and an online retrieval-and-generation phase that runs at query time. Getting both right is what separates a working demo from a system that performs reliably in production.

1. Document ingestion and chunking

Your source documents — PDFs, HTML pages, Markdown files, database records, Confluence pages, Notion exports — are loaded and split into chunks. Chunking strategy is one of the most consequential decisions in a RAG system. Chunks that are too large dilute retrieval precision: when you retrieve a 3,000-token passage to answer a specific question, most of that context is noise. Chunks that are too small lose the surrounding context that gives a sentence its meaning.

Common strategies include fixed-size chunking with overlap (simple, predictable), recursive character splitting (respects natural text boundaries), semantic chunking (splits at meaningful topic transitions), and document-structure-aware chunking (preserves headers, tables, and code blocks). The right strategy depends on your document types. Technical documentation benefits from structure-aware chunking. Dense prose benefits from semantic chunking. Tables often need to be handled as standalone units.

2. Embedding

Each chunk is passed through an embedding model that converts text into a dense vector — a list of floating-point numbers that encodes the semantic meaning of the text in a high-dimensional space. Chunks with similar meaning end up close together in this space. This is what makes semantic search possible: you can find chunks that are conceptually related to a query even when they share no exact keywords.

Common embedding models include OpenAI text-embedding-3-small and text-embedding-3-large, Cohere Embed, and open-source models like BGE and E5. The choice of embedding model affects retrieval quality, cost, and latency. Larger models produce more discriminative embeddings but cost more per token. For most production workloads, text-embedding-3-small offers a strong quality-to-cost ratio. For high-stakes retrieval — legal, medical, or compliance documents — the larger model or a domain-fine-tuned embedder is worth the cost.

3. Vector store

The resulting vectors are stored in a vector database alongside their source text and metadata. At query time, this database performs approximate nearest-neighbor search to find the vectors (and therefore chunks) closest to the query vector. The leading vector databases include Pinecone (managed, easy to scale), Weaviate (open-source, supports hybrid search), Qdrant (open-source, high-performance), and pgvector (PostgreSQL extension, ideal when you already run Postgres and want to minimize infrastructure complexity). Chroma is popular for local development and prototyping.

Metadata filtering is a critical capability to evaluate when choosing a vector store. In most production systems, you do not want to search across all documents — you want to search within a specific project, a specific time range, a specific document type, or documents belonging to a specific tenant. A vector store that supports filtered similarity search lets you apply these constraints without sacrificing retrieval quality.

4. Retrieval

At query time, the user's question is embedded using the same model used during indexing. The resulting query vector is used to search the vector database for the top-k most similar chunks. These chunks — typically three to ten, depending on context window size and relevance threshold — are assembled into a context block that is prepended to the user's question before being sent to the LLM.

Hybrid search improves retrieval quality significantly in most real-world deployments. Pure vector search excels at semantic similarity but can miss exact-match queries — product codes, proper names, technical identifiers. Combining vector search with BM25 keyword search (the algorithm behind traditional full-text search) and using a reranker to merge and re-score the combined result set consistently outperforms either method alone. Cohere Rerank and cross-encoders like BGE Reranker are commonly used for this step.

5. Generation

The retrieved context and the user's question are combined into a prompt and passed to the LLM. The model generates a response grounded in the retrieved passages. A well-designed system prompt instructs the model to answer only from the provided context, to acknowledge when it cannot find the answer in the retrieved documents, and to cite its sources. This instruction-following behavior is what prevents the model from hallucinating when retrieval fails — which it will, on some fraction of queries, in any real-world deployment.

Why RAG beats fine-tuning for most use cases

Fine-tuning trains new weights into a model by running gradient updates on a dataset of examples. It is the right tool when you need a model to produce a specific style, follow a particular output format consistently, or perform a task where the behavior pattern itself — not the knowledge — is what needs to change. It is the wrong tool when you need a model to know specific, frequently-updated facts about your business.

The fundamental problem with fine-tuning for knowledge injection is that models do not reliably memorize facts from training data the way a database stores records. They blend knowledge probabilistically across their weights, which means a fine-tuned model may confidently produce answers that blend your training data with its pretraining knowledge in ways that are difficult to predict or audit. RAG, by contrast, makes the knowledge source explicit and inspectable — you can log what was retrieved, verify it is correct, and trace any answer back to its source document.

Fine-tuning is also expensive to maintain. Every time your knowledge base changes — a new policy, an updated product spec, an amended contract — you need to retrain. RAG updates are as simple as re-indexing the changed documents. For knowledge bases that change weekly or daily, RAG is not just more accurate — it is the only operationally viable option.

The two techniques are not mutually exclusive. Production systems sometimes combine them: a fine-tuned model that follows the output format and tone your organization requires, retrieving grounded context via RAG. But if you are choosing between the two for a knowledge retrieval use case, RAG should be your default.

Real-world RAG use cases

Internal Knowledge Bases

Engineering teams, legal departments, and operations teams accumulate enormous amounts of documentation that is practically unsearchable. A RAG system over internal wikis, runbooks, and policies lets employees ask natural-language questions and get grounded answers — reducing the time spent chasing down the right person to ask.

Customer Support

Support agents augmented by RAG can access product documentation, past ticket resolutions, and policy documents in real time during a conversation. Fully automated support bots use RAG to answer common questions without hallucinating features that do not exist or policies that have changed.

Document Search and Review

Legal teams, compliance officers, and researchers working across large document collections use RAG to ask questions across hundreds of contracts, filings, or reports — getting specific answers with citations rather than keyword matches that still require manual reading.

Sales and Procurement

Sales teams use RAG over product documentation and pricing sheets to generate accurate proposals. Procurement teams use it to search vendor contracts and identify obligations. In both cases, accuracy and attribution matter — the system cannot afford to confabulate terms.

When NOT to use RAG

RAG is not the right answer for every AI use case. Understanding its limitations prevents you from applying it where it will disappoint you.

Tasks requiring deep reasoning, not knowledge retrieval

If the task requires multi-step mathematical reasoning, code generation from scratch, or complex logical inference — and not domain knowledge — RAG adds complexity without improving performance. The bottleneck is the model's reasoning capability, not what it can retrieve.

Highly structured data queries

If users need to query structured data — sales figures, inventory counts, account balances — a Text-to-SQL system that generates and executes database queries is almost always better than vectorizing your database records. RAG works on unstructured text; SQL databases are purpose-built for structured data retrieval.

Sub-second latency requirements

A RAG pipeline adds latency: you pay for the embedding step, the vector search, the optional reranking, and then the generation call. For applications where response must arrive in under 200ms — real-time autocomplete, trading systems, interactive games — RAG in its standard form is usually too slow. Caching common queries can help, but there are architectural limits.

Knowledge bases under ~20 documents

For very small knowledge bases, fitting everything into a single large context window with a long-context model like GPT-4o or Claude 3.5 Sonnet is often simpler and more accurate than building a retrieval pipeline. RAG's value is in making large, heterogeneous corpora searchable — at small scales, the infrastructure overhead rarely justifies itself.

The RAG technical stack

No single tool covers the full RAG pipeline. Production systems are composed of a document loader, a chunking library, an embedding model, a vector store, an optional reranker, and an LLM. The orchestration layer connects them.

Orchestration frameworks

LangChain provides a composable set of abstractions — document loaders, text splitters, retrievers, chains — that wire together the RAG pipeline components. It is the most widely used RAG framework and has the largest ecosystem of integrations. LlamaIndex (formerly GPT Index) is optimized specifically for RAG use cases and offers more sophisticated indexing strategies out of the box, including hierarchical indexing and knowledge graphs. For teams that want maximum control, building without a framework using the raw APIs of the embedding model and vector database is a valid and often cleaner choice for simpler pipelines.

Vector databases

Pinecone is the dominant managed option — easy to start, scales well, and handles metadata filtering cleanly. Weaviate offers an open-source alternative with native support for hybrid search. pgvector extends PostgreSQL with vector similarity search, making it the low-overhead choice for teams already running Postgres who want to avoid adding another managed service. Qdrant is gaining adoption for its performance and Rust-based architecture. Chroma is the standard choice for local development and research prototypes.

Evaluation

RAGAs (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG systems. It measures faithfulness (does the answer follow from the retrieved context?), answer relevancy (does the answer address the question?), context precision (is the retrieved context relevant?), and context recall (did the system retrieve enough of the relevant information?). Evaluating these metrics separately is what allows you to diagnose whether a system failure is a retrieval problem or a generation problem.

We build RAG systems.

From the chunking strategy and embedding pipeline to the vector database schema, retrieval logic, and production deployment — we design and ship RAG systems that work accurately and hold up under real usage. Talk to us about what you are trying to build.

Our RAG services