Reference guides

This page contains a complete library of configuration references to set up an evaluation with Velvet.

Parameters: Customize every detail of your experiment from parameters to complex configs.
Metrics: Measure your model's performance with pre-defined metrics.
Examples: Ready-to-use configuration code that can be copied into your own experiments.

Use Cases

Factual Accuracy: Assess the accuracy and truthfulness of model outputs to ensure they provide reliable information.
JSON Outputs: Assess the structure of JSON outputs generated by models to ensure they meet the required format and standards.
RAG Pipelines: Test retrieval-augmented generation (RAG) pipelines for effectiveness in integrating external knowledge into model outputs.
OpenAI Assistants: Test performance and effectiveness of OpenAI's assistant models.
Prevent Hallucinations: Test strategies to reduce or eliminate hallucinations in model outputs.
Safety in LLM Apps: Conduct sandboxed evaluations for LLM applications to identify vulnerabilities.
Benchmark Models: Conduct performance benchmarks for various language models including latency, cost, etc. Determine strengths, weaknesses, and optimal use cases.
Compare Model Configurations: Determine the best model for each feature, and optimize model output quality by selecting the appropriate settings.

Configuration

A configuration represents an experiment that is run.

Property	Type	Required	Description
name	string	Yes	Name of your experiment
description	string	No	Description of your experiment
providers	string[]	Provider[]	Yes	One or more LLMs to use
prompts	string[]	Yes	One or more prompts to load
tests	string	Evaluation[]	Yes	List of LLM inputs and evaluation metrics OR path to a Google Sheet share link

Provider

Provider is an object that includes the id of the provider and an optional config object that can be used to pass provider-specific configurations.

interface Provider {
  id?: ProviderId; // e.g. "openai:gpt-4o-mini"
  config?: ProviderConfig; 
}

Velvet supports the following models:

openai:<model name> - uses a specific model name (mapped automatically to chat or completion endpoint)
openai:embeddings:<model name> - uses any model name against the /v1/embeddings endpoint

Here are the optional config parameters:

interface ProviderConfig {
  // Completion parameters
  temperature?: number;
  max_tokens?: number;
  top_p?: number;
  frequency_penalty?: number;
  presence_penalty?: number;
  best_of?: number;
  functions?: OpenAiFunction[];
  function_call?: 'none' | 'auto' | { name: string };
  tools?: OpenAiTool[];
  tool_choice?: 'none' | 'auto' | 'required' | { type: 'function'; function?: { name: string } };
  response_format?: { type: 'json_object' | 'json_schema'; json_schema?: object };
  stop?: string[];
  seed?: number;
  passthrough?: object;
  functionToolCallbacks?: Record<
    OpenAI.FunctionDefinition['name'],
    (arg: string) => Promise<string>
  >;
  apiKey?: string;
  apiKeyEnvar?: string;
  apiHost?: string;
  apiBaseUrl?: string;
  organization?: string;
  headers?: { [key: string]: string };
}

Evaluation

An evaluation represents a single set of inputs and evaluation metrics that is fed into all prompts and providers.

Property	Type	Required	Description
vars	Record<string, string>	string[]	any>	string	No	Key-value pairs to substitute in the prompt. If `vars` is a plain string, it can be used to load vars from a SQL query to your Velvet DB.
assert	Assertion[]	No	List of evaluation checks to run on the LLM output
threshold	number	No	Test will fail if the combined score of assertions is less than this number
options.transform	string	No	A JavaScript snippet that runs on LLM output before any assertions

Assertion

An assertion is an evaluation that compares the LLM output against expected values or conditions. Different types of assertions can be used to validate the output in various ways, such as checking for equality, similarity, or custom functions.

Property	Type	Required	Description
type	string	Yes	Type of assertion
value	string	No	The expected value, if applicable
threshold	number	No	The threshold value, applicable only to certain types such as `similar`, `cost`, `javascript`
metric	string	No	The label for this result. Assertions with the same `metric` will be aggregated together

See examples of Deterministic evaluation assertions and LLM based evaluation assertions.