Deterministic evals

Reference guide to logical evaluations on LLM output

This page contains a library of different metrics and validations you can leverage, including but not limited to:

  • Text Validation: contains, contains-all, contains-any
  • Performance Metrics: cost, latency, perplexity
  • Equality and Fuzzy Match: equals, levenshtein , regex
  • JSON Validation: is-json , contains-json
  • Custom Logic: javascript

By leveraging these evaluation metrics, you can enhance the reliability, performance, and accuracy of your LLM outputs and ensure they meet your specific goals.


Deterministic Types

Assertion TypeReturns true if...
containsoutput contains substring
contains-alloutput contains all list of substrings
contains-anyoutput contains any of the listed substrings
contains-jsonoutput contains valid json (optional json schema validation)
costInference cost is below a threshold
equalsoutput matches exactly
icontainsoutput contains substring, case insensitive
icontains-alloutput contains all list of substrings, case insensitive
icontains-anyoutput contains any of the listed substrings, case insensitive
is-jsonoutput is valid json (optional json schema validation)
javascriptprovided Javascript function validates the output
latencyLatency is below a threshold (milliseconds)
levenshteinLevenshtein distance is below a threshold
perplexityPerplexity is below a threshold
regexoutput matches regex

Contains

The contains assertion checks if the LLM output contains the expected value.

Example:

{
  "assert": [
    {
      "type": "contains",
      "value": "The expected substring"
    }
  ]
}

The icontains is the same, except it ignores case:

{
  "assert": [
    {
      "type": "icontains",
      "value": "The expected substring"
    }
  ]
}


Contains-All

The contains-all assertion checks if the LLM output contains all of the specified values.

Example:

{
  "assert": [
    {
      "type": "contains-all",
      "value": [
        "Value 1",
        "Value 2",
        "Value 3"
      ]
    }
  ]
}


Contains-Any

The contains-any assertion checks if the LLM output contains at least one of the specified values.

Example:

{
  "assert": [
    {
      "type": "contains-any",
      "value": [
        "Value 1",
        "Value 2",
        "Value 3"
      ]
    }
  ]
}

For case insensitive matching, use icontains-any. For case insensitive matching, use icontains-all.


Contains-JSON

The contains-json assertion checks if the LLM output contains a valid JSON structure.

Example:

{
  "assert": [
    {
      "type": "contains-json"
    }
  ]
}

You may optionally set a value as a JSON schema in order to validate the JSON contents:

{
  "assert": [
    {
      "type": "contains-json",
      "value": {
        "required": ["latitude", "longitude"],
        "type": "object",
        "properties": {
          "latitude": {
            "minimum": -90,
            "type": "number",
            "maximum": 90
          },
          "longitude": {
            "minimum": -180,
            "type": "number",
            "maximum": 180
          }
        }
      }
    }
  ]
}

See also: is-json


Regex

The regex assertion checks if the LLM output matches the provided regular expression.

Example:

{  
  "assert": {  
    "type": "regex",  
    "value": "\\d{4}"  
  }  
}  

Cost

The cost assertion checks if the cost of the LLM call is below a specified threshold. This requires LLM providers to return cost information. Currently, this is only supported by OpenAI GPT models and custom providers.

Example:

{
  "providers": [
    "openai:gpt-4o-mini",
    "openai:gpt-4"
  ],
  "assert": [
    {
      "type": "cost",
      "threshold": 0.001
    }
  ]
}


Equality

The equals assertion checks if the LLM output is equal to the expected value.

Example:

{
  "assert": [
    {
      "type": "equals",
      "value": "The expected output"
    }
  ]
}

You can also check whether it matches the expected JSON format.

{
  "assert": [
    {
      "type": "equals",
      "value": { "key": "value" }
    }
  ]
}


Is-JSON

The is-json assertion checks if the LLM output is a valid JSON string.

Example:

{
  "assert": [
    {
      "type": "is-json"
    }
  ]
}

You may optionally set a value as a JSON schema. If set, the output will be validated against this schema:

{
  "assert": [
    {
      "type": "is-json",
      "value": {
        "required": ["latitude", "longitude"],
        "type": "object",
        "properties": {
          "latitude": {
            "minimum": -90,
            "type": "number",
            "maximum": 90
          },
          "longitude": {
            "minimum": -180,
            "type": "number",
            "maximum": 180
          }
        }
      }
    }
  ]
}


Javascript

The javascript assertion allows you to provide a custom JavaScript function to validate the LLM output.

A variable named output is injected into the context. The function should return true if the output passes the assertion, and false otherwise. If the function returns a number, it will be treated as a score.

You can use any valid JavaScript code in your function. The output of the LLM is provided as the output variable:

{  
  "assert": {  
    "type": "javascript",  
    "value": "output[0].function.name === 'get_current_weather'"  
  }  
}  

Latency

The latency assertion passes if the LLM call takes longer than the specified threshold. Duration is specified in milliseconds.

Example:

{
  "assert": [
    {
      "type": "latency",
      "threshold": 5000 // Fail if the LLM call takes longer than 5 seconds
    }
  ]
}


Levenshtein Distance

The levenshtein assertion checks if the LLM output is within a given edit distance from an expected value.

Example:

{
  "assert": [
    {
      "type": "levenshtein",
      "threshold": 5,
      "value": "hello world" // Ensure Levenshtein distance from "hello world" is <= 5
    }
  ]
}

value can reference other variables using template syntax. For example:

{
  "tests": [
    {
      "vars": {
        "expected": "foobar"
      },
      "assert": [
        {
          "type": "levenshtein",
          "threshold": 2,
          "value": "{{expected}}"
        }
      ]
    }
  ]
}


Perplexity

Perplexity is a measurement used in natural language processing to quantify how well a language model predicts a sample of text. It's essentially a measure of the model's uncertainty. High perplexity suggests it is less certain about its predictions, often because the text is very diverse or the model is not well-tuned to the task at hand. Low perplexity means the model predicts the text with greater confidence, implying it's better at understanding and generating text similar to its training data.

To specify a perplexity threshold, use the perplexity assertion type:

{
  "assert": [
    {
      "type": "perplexity",
      "threshold": 1.5 // Fail if the LLM is below perplexity threshold
    }
  ]
}

Warning: Perplexity requires the LLM API to output logprobs. Currently only more recent versions of OpenAI GPT APIs support this.