LLM based evaluations reference
Machine learning based evaluations on LLM outputs
Overview
This page contains a library of LLM based metrics you can leverage:
- Similarity Metrics:
similar
- Rubric and Grading Metrics:
llm-rubric
,model-graded-closedqa
- Relevance and Context Metrics:
answer-relevance
,context-faithfulness
,context-recall
,context-relevance
- Factual Accuracy:
factuality
- Selection Metrics:
select-best
By using these evaluation metrics, you can significantly enhance the overall quality of your LLM outputs.
LLM-based Types
Assertion Type | Method |
---|---|
similar | Embeddings and cosine similarity are above a threshold |
llm-rubric | LLM output matches a given rubric, using a Language Model to grade output |
answer-relevance | Ensure that LLM output is related to original query |
context-faithfulness | Ensure that LLM output uses the context |
context-recall | Ensure that ground truth appears in context |
context-relevance | Ensure that context is relevant to original query |
factuality | LLM output adheres to the given facts, using Factuality method from OpenAI eval |
model-graded-closedqa | LLM output adheres to given criteria, using Closed QA method from OpenAI eval |
select-best | Compare multiple outputs for a test case and pick the best one |
See examples below:
llm-rubric / model-graded-closedqa
{
"assert": [
{
"type": "llm-rubric", // or model-graded-closedqa
"value": "Is funny"
}
]
}
factuality
{
"assert": [
{
"type": "factuality",
"value": "Sacramento is the capital of California"
}
]
}
similar
The similar
assertion checks if an embedding of the LLM's output is semantically similar to the expected value, using a cosine similarity threshold.
By default, embeddings are computed via OpenAI's text-embedding-3-large
model.
{
"assert": {
"type": "similar",
"value": "The expected output",
"threshold": 0.8
}
}
select-best
The select-best
assertion type is used to compare multiple outputs and select the one that best meets a specified criterion.
{
"prompts": [
"Write a tweet about {{topic}}",
"Write a very concise, funny tweet about {{topic}}"
],
"providers": [
"openai:gpt-4"
],
"tests": [
{
"vars": {
"topic": "bananas"
},
"assert": [
{
"type": "select-best",
"value": "choose the funniest tweet"
}
]
},
{
"vars": {
"topic": "nyc"
},
"assert": [
{
"type": "select-best",
"value": "choose the tweet that contains the most facts"
}
]
}
]
}
RAG-based metrics
RAG metrics require variables named context
and query
. You must also set the threshold
property on your test (all scores are normalized between 0 and 1).
{
"name": "RAG Experiment",
"description":"This is an example RAG experiment that evaluates RAG context metrics across two openai models.",
"prompts": [
"You are a chatbot that answers questions about the Velvet product.\nRespond to this query in one statement: {{query}}\nHere is some context that you can use to write your response: {{context}}"
],
"providers": [
"openai:gpt-4o",
"openai:gpt-4o-mini"
],
"tests": [
{
"vars": {
"query": "How many lines of code do I need to get started with Velvet?",
"context": "Velvet is a proxy to warehouse every LLM request to your database. Use the gateway to optimize usage and cost, run experiments, and generate datasets for fine-tuning. Just 2 lines of code to get started. Every request will be stored to your database."
},
"assert": [
{
"type": "factuality",
"value": "2 lines"
},
{
"type": "answer-relevance",
"threshold": 0.8
},
{
"type": "context-recall",
"threshold": 0.8,
"value": "2 lines of code to get started"
},
{
"type": "context-relevance",
"threshold": 0.8
},
{
"type": "context-faithfulness",
"threshold": 0.8
}
]
}
]
}
Updated 6 days ago