JSON evaluations

Trusting LLMs to output valid JSON can be tricky. They often struggle with dynamic data, inconsistent schemas, and hallucinating values. This doc outlines an approach to evaluate JSON from LLMs and ensure the outputs are both accurate and reliable.

Issues with JSON outputs

Inconsistent Schema: JSON may be malformed or incomplete.
Hallucination: LLMs often hallucinate numbers and values.
Dynamic Data: LLMs struggle with adapting to dynamic or variable input data.

Evaluate JSON outputs

Validate JSON Structure: Ensure the generated JSON adheres to a predefined schema.
Compare Values: Assess key and value accuracy against expected criteria.
Measure Output Relevancy: Evaluate the relevancy of JSON responses with an LLM judge.

Example

Imagine our language model outputs this JSON object:

{
  "gender": "Male",
  "industries": ["AI", "Software", "Big Data"]
}

To ensure fields like gender and industries are correct, we need to create assertions.

Basic JSON Validation

Use the is-json assertion to check if the output is a valid JSON:

{
  "type": "is-json"
}

Schema Validation

To validate the JSON structure, define a schema:

{
  "type": "is-json",
  "value": {
    "required": ["gender", "industries"],
    "type": "object",
    "properties": {
      "gender": {
        "type": "string"
      },
      "industries": {
        "type": "array",
        "items": {
          "type": "string"
        }
      }
    }
  }
}

This schema will validate that the output is a valid JSON and includes all necessary fields with the right data types.

Compare Values

Use javascript assertions to write custom code for more advanced checks against the output values:

{
	"type": "javascript",
	"value": "JSON.parse(output).gender == 'male'"
}

Measure Output Relevancy with an LLM

Velvet supports LLM based assertions such as model-graded-closedqa and llm-rubric. To use it, add the transform directive to preprocess the output, then write a prompt to grade the relevancy of the value:

{
	"transform": "JSON.parse(output).industries",
	"type": "model-graded-closedqa",
	"value": "contains an industry similar to AI"
}

Evaluation Configuration

Here's the complete example configuration for evaluating LLMs with JSON:

{
  "name": "JSON evaluation",
  "description": "Evaluate JSON outputs for schema validity and accuracy.",
  "prompts": [
    "Output a JSON object that contains the keys `gender` and `industries`, describing the following person: {{input}}"
  ],
  "providers": [
    {
      "id": "openai:gpt-3.5-turbo-0125",
      "config": {
        "response_format": { "type": "json_object" }
      }
    },
    {
      "id": "openai:gpt-4o-mini",
      "config": {
        "response_format": { "type": "json_object" }
      }
    }
  ],
  "tests": [
    {
      "vars": {
        "input": [
          "Andrew Ng. Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of Landing AI.",
          "Ian Goodfellow. Research Scientist. I'm an industry leader in machine learning.",
          "Ilya Sutskever. Co-Founder and Chief Scientist at Safe Superintelligence Inc."
        ]
      },
      "assert": [
        {
          "type": "is-json",
          "value": {
            "required": ["gender", "industries"],
            "type": "object",
            "properties": {
              "gender": {
                "type": "string"
              },
              "industries": {
                "type": "array",
                "items": {
                  "type": "string"
                }
              }
            }
          }
        },
        {
          "type": "javascript",
          "value": "JSON.parse(output).gender == 'male'"
        },
        {
          "transform": "JSON.parse(output).industries",
          "type": "model-graded-closedqa",
          "value": "contains an industry similar to AI"
        }
      ]
    }
  ]
}

Summary

By using structured JSON-based evaluations, including schema validation and custom assertions, you can validate that LLM outputs meet your required standards. This process will detect potentials issues and enhances the reliability of your LLM app.