Definition

What is RAG (Retrieval-Augmented Generation)?

Last updated

A technique that retrieves relevant documents or data at inference time and injects them into the model's context window before generating a response.

RAG gives LLMs access to information they were never trained on, or that changes faster than realistic retraining cycles. The model stops being a sealed box of frozen facts and becomes a reasoning layer on top of a live document index. In practice, quality depends less on which model you pick than on how documents are chunked, retrieved, and assembled into the prompt.

  • Combines a retriever (typically semantic search over a document index) with a generator (the LLM).
  • Grounds answers in source material, reducing hallucinations and enabling citations back to the source.
  • Works on private data or data that changes faster than retraining cycles.
  • Scopes cleanly per user or tenant by filtering the index at query time.
  • Quality depends more on chunking and retrieval than on model choice.

How RAG works

RAG has three stages. Index: documents are chunked, embedded into a vector space, and stored in a retrieval index. Retrieve: the user’s query is embedded the same way and the top-k most similar chunks are pulled. Generate: those chunks are inserted into the prompt and the model answers against them.

Most RAG failures are retrieval failures, not generation failures. One survey of RAG systems found roughly 70 percent of retrieved passages do not directly contain the information needed to answer the query. The retriever returned relevant-looking documents, but the assembled context wasn’t sufficient. Chunking strategy, embedding quality, and result re-ranking usually matter more than the choice of foundation model.

Why RAG matters

LLMs have fixed training cutoffs and cannot see private or proprietary data. Fine-tuning is expensive and slow to update. RAG fills this gap: plug any document corpus into any model, update the index in minutes, and return answers that are traceable to their source.

Three properties make RAG the default enterprise pattern:

  • Freshness. Re-indexing takes minutes, retraining takes days. Anything downstream of a changing source of truth pays a compounding accuracy tax under fine-tuning.
  • Provenance. Answers carry citations back to the underlying passages. Regulated industries and customer-facing applications can audit where a claim came from.
  • Scope. The same index can serve different users by filtering on tenant, role, or access grant, which makes per-user data isolation practical.

Common misconceptions about RAG

  • “RAG means vector search.” Vector similarity is one retrieval strategy. Keyword search (BM25), hybrid dense-plus-sparse, and graph-based retrieval are all valid RAG backends. Many production systems use hybrid approaches.
  • “Bigger context windows kill RAG.” Long context helps, but accuracy still degrades as input length grows (context rot). Stuffing a million tokens of loosely relevant text still produces worse answers than a focused, well-retrieved prompt.
  • “RAG eliminates hallucinations.” It reduces them when retrieval succeeds. When retrieval fails, the model still generates an answer, often a plausible but ungrounded one. Citations are only as honest as the retriever that produced them.
  • “RAG is a product.” It’s a pattern. A production RAG system is really indexing, retrieval, re-ranking, and prompting glue, each with its own failure modes.

RAG and Wire

Wire handles the index and retrieve halves of RAG. When you upload files to a container, Wire chunks them, generates embeddings, and exposes the index through the wire_search tool. Your agent keeps the generator role: it queries the container at inference time like any retrieval backend. You get the RAG pattern without standing up a vector database, chunking pipeline, or embeddings infra.

FAQ

Frequently asked questions

Common questions about RAG (Retrieval-Augmented Generation).

Is RAG the same as semantic search?
No. Semantic search is the retrieval step. RAG adds a generation step on top, so the LLM produces a written answer grounded in retrieved passages rather than returning a list of documents.
Does RAG require a vector database?
No. Vector databases are the most common implementation because embeddings capture semantic similarity well, but keyword search (BM25), hybrid search, and graph retrieval are all valid RAG backends.
When should I use RAG versus fine-tuning?
RAG is the default when knowledge changes, needs citations, or varies per user. Fine-tuning is the right tool for style, output format, and narrow high-volume behavior. The two compose: fine-tune for behavior, retrieve for facts.
Why does RAG sometimes produce wrong answers even with retrieval?
Retrieval can pull relevant-looking but unhelpful passages, or pull the right content in fragments that prevent synthesis. Research shows around 70 percent of retrieved passages don't directly contain the answer. Chunking strategy, re-ranking, and prompt construction matter more than they sound.
How is RAG different from a context window?
The context window is where retrieved data lands. RAG is how it gets there. They work together: RAG decides what goes in, the context window decides how much fits.

Put context into practice

Create your first context container and connect it to your AI tools in minutes.

Create Your First Container