Evaluating RAG pipelines with the RAG triad

Retrieval-Augmented Generation (RAG) emerged as a dominant framework for feeding Large Language Models (LLMs) the context beyond the scope of their training data and enabling LLMs to respond with more grounded answers and fewer hallucinations based on that context.

However, designing an effective RAG pipeline can be challenging. You need to answer questions such as:

How should you parse and chunk text documents for vector embedding? What chunk size and overlay size should you use?
What vector embedding model should you use?
What retrieval method should I use to fetch the relevant context? How many documents should you retrieve by default? Does the retriever 1.actually manage to retrieve the applicable documents?
Does the generator actually generate content that is in line with the retrieved context? What parameters (model, prompt template, temperature) work better?

The only way to objectively answer these questions is to measure how well the RAG pipeline works, but what exactly do you measure, and how do you measure it? This is the topic I’ll cover here.

See also