Reference guides
Key components for configuring experiments
This page contains a complete library of configuration references to set up an evaluation with Velvet.
- Parameters: Customize every detail of your experiment from parameters to complex configs.
- Metrics: Measure your model's performance with pre-defined metrics.
- Examples: Ready-to-use configuration code that can be copied into your own experiments.
Use Cases
- Factual Accuracy: Assess the accuracy and truthfulness of model outputs to ensure they provide reliable information.
- JSON Outputs: Assess the structure of JSON outputs generated by models to ensure they meet the required format and standards.
- RAG Pipelines: Test retrieval-augmented generation (RAG) pipelines for effectiveness in integrating external knowledge into model outputs.
- OpenAI Assistants: Test performance and effectiveness of OpenAI's assistant models.
- Prevent Hallucinations: Test strategies to reduce or eliminate hallucinations in model outputs.
- Safety in LLM Apps: Conduct sandboxed evaluations for LLM applications to identify vulnerabilities.
- Benchmark Models: Conduct performance benchmarks for various language models including latency, cost, etc. Determine strengths, weaknesses, and optimal use cases.
- Compare Model Configurations: Determine the best model for each feature, and optimize model output quality by selecting the appropriate settings.
Configuration
A configuration represents an experiment that is run.
Property | Type | Required | Description | |
---|---|---|---|---|
name | string | Yes | Name of your experiment | |
description | string | No | Description of your experiment | |
providers | string[] | Provider[] | Yes | One or more LLMs to use |
prompts | string[] | Yes | One or more prompts to load | |
tests | string | Evaluation[] | Yes | List of LLM inputs and evaluation metrics OR path to a Google Sheet share link |
Provider
Provider is an object that includes the id
of the provider and an optional config
object that can be used to pass provider-specific configurations.
interface Provider {
id?: ProviderId; // e.g. "openai:gpt-4o-mini"
config?: ProviderConfig;
}
Velvet supports the following models:
openai:<model name>
- uses a specific model name (mapped automatically to chat or completion endpoint)openai:embeddings:<model name>
- uses any model name against the/v1/embeddings
endpoint
Here are the optional config
parameters:
interface ProviderConfig {
// Completion parameters
temperature?: number;
max_tokens?: number;
top_p?: number;
frequency_penalty?: number;
presence_penalty?: number;
best_of?: number;
functions?: OpenAiFunction[];
function_call?: 'none' | 'auto' | { name: string };
tools?: OpenAiTool[];
tool_choice?: 'none' | 'auto' | 'required' | { type: 'function'; function?: { name: string } };
response_format?: { type: 'json_object' | 'json_schema'; json_schema?: object };
stop?: string[];
seed?: number;
passthrough?: object;
functionToolCallbacks?: Record<
OpenAI.FunctionDefinition['name'],
(arg: string) => Promise<string>
>;
apiKey?: string;
apiKeyEnvar?: string;
apiHost?: string;
apiBaseUrl?: string;
organization?: string;
headers?: { [key: string]: string };
}
Evaluation
An evaluation represents a single set of inputs and evaluation metrics that is fed into all prompts and providers.
Property | Type | Required | Description | |||
---|---|---|---|---|---|---|
vars | Record<string, string> | string[] | any> | string | No | Key-value pairs to substitute in the prompt. If vars is a plain string, it can be used to load vars from a SQL query to your Velvet DB. |
assert | Assertion[] | No | List of evaluation checks to run on the LLM output | |||
threshold | number | No | Test will fail if the combined score of assertions is less than this number | |||
options.transform | string | No | A JavaScript snippet that runs on LLM output before any assertions |
Assertion
An assertion is an evaluation that compares the LLM output against expected values or conditions. Different types of assertions can be used to validate the output in various ways, such as checking for equality, similarity, or custom functions.
Property | Type | Required | Description |
---|---|---|---|
type | string | Yes | Type of assertion |
value | string | No | The expected value, if applicable |
threshold | number | No | The threshold value, applicable only to certain types such as similar , cost , javascript |
metric | string | No | The label for this result. Assertions with the same metric will be aggregated together |
See examples of Deterministic evaluation assertions and LLM based evaluation assertions.
Updated about 2 months ago