Agent evaluations

Ensure the reliability of LLM agents

Agents are LLM apps designed to autonomously perform tasks, make decisions, and interact with users. Agents are based on a list of prompts where each LLM call is executed, and the result is fed into the next prompt. Problems can arise with unreliable reasoning, inconsistent data flow, and frequent hallucinations.


Issues with agents

  • Unreliable Reasoning: LLMs often struggle to make accurate and consistent decisions. For example, an AI customer support agent might misunderstand a user's request for a refund, offering a discount instead.
  • Inconsistent Data Flow: Managing data flow in complex applications can be challenging. For example, a medical diagnosis AI that sometimes provides a recommendation and sometimes doesn't.
  • Frequent Hallucinations: LLMs sometimes generate false or irrelevant information, known as hallucinations. For example, an AI writing assistant might invent fictitious historical events that mislead users.

Evaluate agents

  • Unit Testing: Break down AI agents into individual components to evaluate. Ensure each component of the AI agent is tested individually to reduce hallucinations and false outputs.
  • End-to-End Testing: Assess the entire system's performance with A/B tests and API evaluations.
  • Structured Data Flow Testing: Implement structured evaluations to ensure LLMs produce consistent and accurate outputs. See JSON evals for an additional example.

See this [Agent experiment example](https://www.usevelvet.com/dashboard/velvet-demo/sandbox-logs/evaluations/exp_01jef0n8t27tdj6akrevtas710) in the Velvet demo space.

See this Agent experiment example in the Velvet demo space.


Example

Evaluate AI agents using two main approaches; Unit testing and end-to-end testing.


Unit Testing AI Agents

Unit testing involves breaking down the AI agent into its smallest functional components and testing each one individually.

To test AI agents one unit at a time, you can break the AI agent into individual components, and use a basic configuration to test each unit.

{
  "name": "AI Agent Single Step Evaluation",
  "description": "This experiment tests prompt responses on a single step of an AI agent that infers a list of criteria from LinkedIn profile texts.",
  "prompts": [
    [
      {
        "role": "system",
        "content": "You are a helpful assistant"
      },
      {
        "role": "user",
        "content": "Infer a list of criteria for this person such as works at a startup, is a founder, HQ location, etc: {{body}}"
      }
    ]
  ],
  "providers": [
    {
      "id": "openai:gpt-3.5-turbo-0125",
      "config": {
        "max_tokens": 1024,
        "temperature": 0.1
      }
    },
    {
      "id": "openai:gpt-4o-mini",
      "config": {
        "organization": "",
        "temperature": 0.1,
        "max_tokens": 1024,
        "top_p": 1,
        "frequency_penalty": 0,
        "presence_penalty": 0
      }
    }
  ],
  "tests": [
    {
      "vars": {
        "body": "Andrew Ng. Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of Landing AI. HQ in New York Area."
      },
      "assert": [
        {
          "type": "factuality",
          "value": "Is a founder"
        },
        {
          "type": "model-graded-closedqa",
          "value": "Mentions work location"
        }
      ]
    },
    {
      "vars": {
        "body": "Mark Hanover. Executive Chef at Canva. Located at The Rocks, New South Wales, Australia."
      },
      "assert": [
        {
          "type": "factuality",
          "value": "Is not a founder"
        },
        {
          "type": "model-graded-closedqa",
          "value": "Mentions work location"
        }
      ]
    }
  ]
}

See API configurations for additional evaluation examples.


End-to-End Testing for AI Agents

End-to-end testing assesses the AI agent's overall performance from start to finish. This approach is crucial for understanding how well the entire system functions when all components are integrated.

To test the entire AI agent end to end, provide an API that takes an input and outputs the result of your AI agent. Below is an example configuration to test your AI agent by sending questions to an API endpoint.

{  
  "name": "AI Agent End to End Evaluation",
  "prompts": [  
    "{{question}}"  
  ],  
  "providers": [  
    {  
      "id": "https", 
      "config": {  
        "url": "https://example-ai-agent.com/process",  
        "method": "POST",
        "headers": {  
          "Content-Type": "application/json"  
        },
        "body": {  
          "question": "{{question}}"
        }
      }  
    }  
  ], 
  "tests": [
    {  
      "vars": {  
        "question": "Who are the top AI startup founders in United States with an in-house chef in the startup?"
      },  
      "assert": [
        {  
          "type": "contains-all",  
          "value": [  
            "Melanie Perkins",  
            "Sam Altman"
          ]  
        }  
      ]  
    }  
  ]  
}  

For more on evaluation metric configurations, see our reference guides.


Summary

Use evaluations to reliably test AI agents in production. Avoid unreliable reasoning, inconsistent data flow, and hallucinations. Implement testing to increase confidence in your AI agents, and improve them over time.