RAG
Definition
Retrieval-Augmented Generation — AI technique enhancing language models by retrieving relevant documents before generating grounded, accurate responses.
Use Cases
- Klarna: Customer support assistant that answers questions using internal help-center and policy content to reduce human agent workload. — Klarna publicly described deploying an AI assistant that uses retrieval over company knowledge (e.g., FAQs and support articles) to ground responses, with LLM-based generation on top of retrieved content and continuous updates as documentation changes. (Klarna reported the assistant handled a large share of customer service interactions and improved response speed while reducing support workload.)
- Morgan Stanley: Internal assistant for financial advisors to quickly find and summarize firm research and documents for client interactions. — Morgan Stanley publicly described using GPT-based models with retrieval over internal document repositories so responses are grounded in approved content and can cite sources for compliance and trust. (Morgan Stanley reported improved advisor productivity by reducing time spent searching across documents and accelerating access to relevant research.)
- Duolingo: AI-powered learning features that generate explanations and help while referencing course content and learning context to keep responses aligned with curriculum. — Duolingo has publicly discussed using LLMs in product features; a common approach for such features is retrieval of relevant lesson content/context before generation to keep outputs consistent with learning materials. (Duolingo reported increased engagement and expanded premium offerings tied to AI features.)
Frequently Asked Questions
- What's the difference between RAG and fine-tuning?
- RAG looks up relevant information at question time (from documents, databases, or search indexes) and then has the model write an answer using that retrieved context. Fine-tuning changes the model’s behavior by training it on examples ahead of time. Use RAG when you need up-to-date, source-grounded answers; use fine-tuning when you need consistent style, formatting, or task behavior and the knowledge doesn’t change often.
- When should I use RAG?
- Use RAG when answers must be based on specific, changing, or proprietary information—like product docs, policies, tickets, contracts, or internal wikis. It’s especially useful when you need citations, want to reduce hallucinations, or can’t put sensitive knowledge into model training. If your task is mostly general reasoning with no private knowledge, plain prompting may be enough.
- How much does RAG cost?
- RAG cost depends on (1) LLM usage (input/output tokens), (2) embedding generation for documents and queries, (3) vector storage and search (index size, query volume, latency tier), and (4) ingestion/ETL (parsing, chunking, re-indexing). Costs rise with more documents, higher query volume, larger context windows, and frequent re-embedding. A common way to control cost is to retrieve fewer, higher-quality chunks, cache results, and use smaller models for retrieval steps while reserving larger models for final answers.
Category: ai-ml
Difficulty: advanced
Related Terms
See Also