Course 15 · Lesson 2 of 8

Free preview

RAG Architecture & Its Failure Modes

Video lesson coming soon
We're filming this one. The full written lesson below is ready to study right now.

Retrieval-Augmented Generation is the workhorse pattern for grounding an LLM in your own knowledge. Most teams build the happy path and get burned by the retrieval half — that’s where RAG actually fails.

The pipeline

Chunk + embed
docs → vectors
Retrieve
top-k relevant chunks
Augment
stuff into the prompt
Generate
answer + citations

Where RAG actually breaks

Retrieval miss
Right answer exists but the wrong chunks were fetched. Garbage in → confident garbage out.
Chunking
Chunks too big (noise) or too small (lost context) wreck relevance.
Stale index
Source changed; embeddings didn’t. Answers from the past.
No grounding check
Model answers from training data, ignoring (or contradicting) the retrieved context.
Evaluate retrieval separately

Measure retrieval (did we fetch the right chunks?) independently from generation (did we answer well from them?). Most “the LLM is dumb” bugs are actually retrieval misses.

Design for citations

Return the source chunks with the answer. Citations make the system verifiable, debuggable, and trustworthy — and turn a black box into something you can audit.

Takeaway

RAG = chunk→retrieve→augment→generate, and it fails mostly at retrieval. Evaluate retrieval separately, keep the index fresh, and return citations so the system is auditable.