RAG

Definition

Retrieval-Augmented Generation — AI technique enhancing language models by retrieving relevant documents before generating grounded, accurate responses.

Use Cases

Frequently Asked Questions

What's the difference between RAG and fine-tuning?
RAG looks up relevant information at question time (from documents, databases, or search indexes) and then has the model write an answer using that retrieved context. Fine-tuning changes the model’s behavior by training it on examples ahead of time. Use RAG when you need up-to-date, source-grounded answers; use fine-tuning when you need consistent style, formatting, or task behavior and the knowledge doesn’t change often.
When should I use RAG?
Use RAG when answers must be based on specific, changing, or proprietary information—like product docs, policies, tickets, contracts, or internal wikis. It’s especially useful when you need citations, want to reduce hallucinations, or can’t put sensitive knowledge into model training. If your task is mostly general reasoning with no private knowledge, plain prompting may be enough.
How much does RAG cost?
RAG cost depends on (1) LLM usage (input/output tokens), (2) embedding generation for documents and queries, (3) vector storage and search (index size, query volume, latency tier), and (4) ingestion/ETL (parsing, chunking, re-indexing). Costs rise with more documents, higher query volume, larger context windows, and frequent re-embedding. A common way to control cost is to retrieve fewer, higher-quality chunks, cache results, and use smaller models for retrieval steps while reserving larger models for final answers.

Category: ai-ml

Difficulty: advanced

Related Terms

See Also