Reference guides

Key components for configuring experiments

This page contains a complete library of configuration references to set up an evaluation with Velvet.

  • Parameters: Customize every detail of your experiment from parameters to complex configs.
  • Metrics: Measure your model's performance with pre-defined metrics.
  • Examples: Ready-to-use configuration code that can be copied into your own experiments.

Use Cases

  • Factual Accuracy: Assess the accuracy and truthfulness of model outputs to ensure they provide reliable information.
  • JSON Outputs: Assess the structure of JSON outputs generated by models to ensure they meet the required format and standards.
  • RAG Pipelines: Test retrieval-augmented generation (RAG) pipelines for effectiveness in integrating external knowledge into model outputs.
  • OpenAI Assistants: Test performance and effectiveness of OpenAI's assistant models.
  • Prevent Hallucinations: Test strategies to reduce or eliminate hallucinations in model outputs.
  • Safety in LLM Apps: Conduct sandboxed evaluations for LLM applications to identify vulnerabilities.
  • Benchmark Models: Conduct performance benchmarks for various language models including latency, cost, etc. Determine strengths, weaknesses, and optimal use cases.
  • Compare Model Configurations: Determine the best model for each feature, and optimize model output quality by selecting the appropriate settings.

Configuration

A configuration represents an experiment that is run.

PropertyTypeRequiredDescription
namestringYesName of your experiment
descriptionstringNoDescription of your experiment
providersstring[]Provider[]YesOne or more LLMs to use
promptsstring[]YesOne or more prompts to load
testsstringEvaluation[]YesList of LLM inputs and evaluation metrics OR path to a Google Sheet share link

Provider

Provider is an object that includes the id of the provider and an optional config object that can be used to pass provider-specific configurations.

interface Provider {
  id?: ProviderId; // e.g. "openai:gpt-4o-mini"
  config?: ProviderConfig; 
}

Velvet supports the following models:

  • openai:<model name> - uses a specific model name (mapped automatically to chat or completion endpoint)
  • openai:embeddings:<model name> - uses any model name against the /v1/embeddings endpoint

Here are the optional config parameters:

interface ProviderConfig {
  // Completion parameters
  temperature?: number;
  max_tokens?: number;
  top_p?: number;
  frequency_penalty?: number;
  presence_penalty?: number;
  best_of?: number;
  functions?: OpenAiFunction[];
  function_call?: 'none' | 'auto' | { name: string };
  tools?: OpenAiTool[];
  tool_choice?: 'none' | 'auto' | 'required' | { type: 'function'; function?: { name: string } };
  response_format?: { type: 'json_object' | 'json_schema'; json_schema?: object };
  stop?: string[];
  seed?: number;
  passthrough?: object;
  functionToolCallbacks?: Record<
    OpenAI.FunctionDefinition['name'],
    (arg: string) => Promise<string>
  >;
  apiKey?: string;
  apiKeyEnvar?: string;
  apiHost?: string;
  apiBaseUrl?: string;
  organization?: string;
  headers?: { [key: string]: string };
}

Evaluation

An evaluation represents a single set of inputs and evaluation metrics that is fed into all prompts and providers.

PropertyTypeRequiredDescription
varsRecord<string, string>string[]any>stringNoKey-value pairs to substitute in the prompt. If vars is a plain string, it can be used to load vars from a SQL query to your Velvet DB.
assertAssertion[]NoList of evaluation checks to run on the LLM output
thresholdnumberNoTest will fail if the combined score of assertions is less than this number
options.transformstringNoA JavaScript snippet that runs on LLM output before any assertions

Assertion

An assertion is an evaluation that compares the LLM output against expected values or conditions. Different types of assertions can be used to validate the output in various ways, such as checking for equality, similarity, or custom functions.

PropertyTypeRequiredDescription
typestringYesType of assertion
valuestringNoThe expected value, if applicable
thresholdnumberNoThe threshold value, applicable only to certain types such as similarcostjavascript
metricstringNoThe label for this result. Assertions with the same metric will be aggregated together

See examples of Deterministic evaluation assertions and LLM based evaluation assertions.