LLM based evaluations reference

Machine learning based evaluations on LLM outputs

Overview

This page contains a library of LLM based metrics you can leverage:

  • Similarity Metrics: similar
  • Rubric and Grading Metrics: llm-rubric, model-graded-closedqa
  • Relevance and Context Metrics: answer-relevance, context-faithfulness, context-recall, context-relevance
  • Factual Accuracy: factuality
  • Selection Metrics: select-best

By using these evaluation metrics, you can significantly enhance the overall quality of your LLM outputs.

LLM-based Types

Assertion TypeMethod
similarEmbeddings and cosine similarity are above a threshold
llm-rubricLLM output matches a given rubric, using a Language Model to grade output
answer-relevanceEnsure that LLM output is related to original query
context-faithfulnessEnsure that LLM output uses the context
context-recallEnsure that ground truth appears in context
context-relevanceEnsure that context is relevant to original query
factualityLLM output adheres to the given facts, using Factuality method from OpenAI eval
model-graded-closedqaLLM output adheres to given criteria, using Closed QA method from OpenAI eval
select-bestCompare multiple outputs for a test case and pick the best one

See examples below:

llm-rubric / model-graded-closedqa

{
  "assert": [
    {
      "type": "llm-rubric",  // or model-graded-closedqa
      "value": "Is funny"
    }
  ]
}

factuality

{
  "assert": [
    {
      "type": "factuality",
      "value": "Sacramento is the capital of California"
    }
  ]
}

similar

The similar assertion checks if an embedding of the LLM's output is semantically similar to the expected value, using a cosine similarity threshold.

By default, embeddings are computed via OpenAI's text-embedding-3-large model.

{  
  "assert": {  
    "type": "similar",  
    "value": "The expected output",  
    "threshold": 0.8  
  }  
}  

select-best

The select-best assertion type is used to compare multiple outputs and select the one that best meets a specified criterion.

{
  "prompts": [
    "Write a tweet about {{topic}}",
    "Write a very concise, funny tweet about {{topic}}"
  ],
  "providers": [
    "openai:gpt-4"
  ],
  "tests": [
    {
      "vars": {
        "topic": "bananas"
      },
      "assert": [
        {
          "type": "select-best",
          "value": "choose the funniest tweet"
        }
      ]
    },
    {
      "vars": {
        "topic": "nyc"
      },
      "assert": [
        {
          "type": "select-best",
          "value": "choose the tweet that contains the most facts"
        }
      ]
    }
  ]
}

RAG-based metrics

RAG metrics require variables named context and query. You must also set the threshold property on your test (all scores are normalized between 0 and 1).

{
  "name": "RAG Experiment",
  "description":"This is an example RAG experiment that evaluates RAG context metrics across two openai models.",
  "prompts": [
    "You are a chatbot that answers questions about the Velvet product.\nRespond to this query in one statement: {{query}}\nHere is some context that you can use to write your response: {{context}}"
  ],
  "providers": [
    "openai:gpt-4o",
    "openai:gpt-4o-mini"
  ],
  "tests": [
    {
      "vars": {
        "query": "How many lines of code do I need to get started with Velvet?",
        "context": "Velvet is a proxy to warehouse every LLM request to your database. Use the gateway to optimize usage and cost, run experiments, and generate datasets for fine-tuning. Just 2 lines of code to get started. Every request will be stored to your database."
      },
      "assert": [
        {
          "type": "factuality",
          "value": "2 lines"
        },
        {
          "type": "answer-relevance",
          "threshold": 0.8
        },
        {
          "type": "context-recall",
          "threshold": 0.8,
          "value": "2 lines of code to get started"
        },
        {
          "type": "context-relevance",
          "threshold": 0.8
        },
        {
          "type": "context-faithfulness",
          "threshold": 0.8
        }
      ]
    }
  ]
}