RAG evaluation
Practical evaluations for improving RAG performance
You can set this evaluation up using our API configuration.
Retrieval-augmented generation (RAG) is a technique for enhancing LLM prompts by pointing at particular datasets. This evaluation example is meant to protect against low-quality retrievals, hallucinations, and other risks associated with relying on arbitrary web documents. As a result, we'll verify the correctness of the LLM output.
Risks with RAG
- Low-Quality Retrievals: The effectiveness of RAG depends on the quality of retrieved context documents. The model's ability to produce accurate responses is limited by access to relevant passages. For example, a legal advice AI that retrieves outdated case law is likely to output inaccurate advice.
- Hallucination and Fabrication: RAG models frequently generate plausible but incorrect statements due to over-reliance on language model priors. Consider a travel planning AI that fabricates details about non-existent tourist attractions. This misinformation can lead users to make travel plans based on false information, ultimately damaging their experience.
- Safety and Control Risks: Conditioning text generation on arbitrary web documents means RAG models can inadvertently spread harmful content. For example, a health information AI might accidentally source content from unreliable websites, promoting dangerous health myths. It's imperative to address these safety and control risks to uphold the integrity of the app.
How to evaluate RAG
- Evaluate Document Retrieval: Continuously monitor and improve the quality of the document retrieval process by using metrics such as relevance scoring to ensure the best documents are being retrieved from the vector store.
- Compare LLM Output: Conduct thorough testing of various prompts and models to assess the correctness and relevance of the LLM output. Use Velvet to compare outputs across different scenarios, ensuring that the model provides accurate and contextually appropriate responses.
- End-to-End Quality Evaluation: Test the entire RAG pipeline from document retrieval to LLM output generation to ensure overall performance and quality.
Example
Evaluating Document Retrieval
Document retrieval is the first step in a RAG pipeline. It is essential to evaluate this step in isolation to ensure the best documents are fetched.
Here’s an example of how to call an API to test document retrieval:
{
"providers": [
{
"id": "https",
"config": {
"url": "https://example-document-retrieval.com/query",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"body": {
"query": "{{query}}"
}
}
}
]
}
Document Retrieval Configuration
Set up an evaluation that runs live document retrieval against the vector database. Create a config with test cases to ensure expected output appear in the document results.
Here’s an example config of a document retrieval evaluation:
{
"name": "RAG Document Retrieval Test",
"prompts": [
"{{ query }}"
],
"providers": [
{
"id": "https",
"config": {
"url": "https://example-document-retrieval.com/query",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"body": {
"query": "{{query}}"
}
}
}
],
"tests": [
{
"vars": {
"query": "Top AI startups with most funding"
},
"assert": [
{
"type": "contains-all",
"value": [
"https://www.crunchbase.com/organization/anthropic",
"https://www.crunchbase.com/organization/openai",
"https://www.crunchbase.com/organization/databricks"
]
}
]
},
{
"vars": {
"query": "AI startup founders with a dog"
},
"assert": [
{
"type": "contains-all",
"value": [
"https://www.linkedin.com/in/philippekahn",
"https://www.linkedin.com/in/juliannaelamb"
]
}
]
}
]
}
Evaluating LLM Output and End-to-End Quality
Once confident in the retrieval step, evaluate the LLM's performance. Set up a prompt and test cases to check the LLM output for correctness and relevance.
Example prompt:
You are a chatbot that answers questions.
Respond to this query in one statement: {{query}}
Here is some context that you can use to write your response: {{context}}
End-to-End Configuration
Here’s an example RAG experiment configuration:
{
"name": "RAG Experiment",
"description": "This is an example RAG experiment that evaluates RAG context metrics across two openai models.",
"prompts": [
"You are a chatbot that answers questions.\\nRespond to this query in one statement: {{query}}\\nHere is some context that you can use to write your response: {{context}}"
],
"providers": [
{ "id": "openai:gpt-4o" },
{ "id": "openai:gpt-4o-mini" }
],
"tests": [
{
"vars": {
"query": "What is the top AI startup with the most funding?",
"context": "Adept: AI model developer, $415 M, 2022, San Francisco, California, United States; Anduril Industries: Defense software and hardware, $2.8 B, 2017, Costa Mesa, California, United States; Anthropic: AI model developer, $7.7 B, 2020, San Francisco, California, United States; Databricks: Data storage and analytics, $4 B, 2013, San Francisco, California, United States; Glean: Enterprise search engine, $360 M, 2019, Palo Alto, California, United States; Hugging Face: Library for AI models and datasets, $395 M, 2016, New York, New York, United States; Insitro: Drug discovery and development, $643 M, 2018, San Francisco, California, United States; Notion: Productivity software, $330 M, 2013, San Francisco, California, United States; OpenAI: AI model developer, $11.3 B, 2015, San Francisco, California, United States; Scale AI: Data labeling and software, $600 M, 2016, San Francisco, California, United States."
},
"assert": [
{
"type": "icontains",
"value": "openai"
},
{
"type": "factuality",
"value": "OpenAI with $11.3 billion raised as of 2023"
},
{
"type": "answer-relevance",
"threshold": 0.8
},
{
"type": "similar",
"value": "top AI startup with the most funding is OpenAI, with $11.3 billion raised",
"threshold": 0.8
}
]
}
]
}
By following this approach, you can ensure the quality of your RAG pipeline and improve it continuously.
For a fully functioning demo experiment, refer to the RAG experiment example in the Velvet demo space.
Summary
Retrieval-augmented generation (RAG) is a powerful tool for enhancing LLM prompts with external data, but it's not without its challenges. We've got to tackle issues like low-quality retrievals, hallucinations, and the risks of relying on arbitrary web content. To do this right, we need a comprehensive strategy that includes rigorous evaluation of document retrieval, precise comparison of LLM outputs, and thorough end-to-end quality assessments.
Updated about 5 hours ago