UI configuration

Set up a standard experiment in the Velvet app

Set up experiments to test models, settings, and metrics. Replay experiments against historical logs to understand the performance of each variant.

UI-based configuration includes common models, provider configuration, and metrics. For more unique or complex use cases, refer to our API-based configuration docs.

How it works

  1. Select log for experiment
  2. Configure evaluation
  3. Review experiment results

(1) Create a new experiment

Navigate to the evaluations tab inside your workspace to get started.

  1. Click the button for 'new evaluation' and select 'experiment'.
  2. Select the dataset (logs) you want to run tests against.
  3. Select which model(s) you want to test.
  4. Define the metric(s) you want to test.


The gateway supports these providers and models out-of-the-box. We can support additional providers and models on our paid plans.

Provider supportModel
OpenAIgpt-4o-mini
gpt-4o
gpt-4-turbo
gpt-4
gpt-3.5-turbo
more
Anthropicclaude-3-5-sonnet-20241022
claude-3-5-sonnet-20240620
claude-3-5-haiku-20241022
claude-3-haiku-20240307
claude-3-sonnet-20240229
claude-3-opus-20240229
OtherEmail [email protected] for additional provider support

You can leverage these metrics when configuring an evaluation from the Velvet app. For additional flexibility, see our API configuration docs.

Metric supportDescription
latencylatency is below a threshold (milliseconds)
costInference cost is below a threshold
llm-rubricLLM output matches a given rubric, using a Language Model to grade output
equalsoutput matches exactly
is-jsonoutput is valid json (optional json schema validation)
perplexityPerplexity is below a threshold
SimilarEmbeddings and cosine similarity are above a threshold
answer-relevanceEnsure that LLM output is related to original query
context-faithfulnessEnsure that LLM output uses the context
context-recallEnsure that ground truth appears in context
context-relevanceEnsure that context is relevant to original query
factualityLLM output adheres to the given facts, using Factuality method from OpenAI eval
model-graded-close-qaLLM output adheres to given criteria, using Closed QA method from OpenAI eval
select-bestCompare multiple outputs for a test case and pick the best one
containsoutput contains substring
contains-alloutput contains all list of substrings
contains-anyoutput contains any of the listed substrings
contain-jsonoutput contains valid json (optional json schema validation)
icontainsoutput contains substring, case insensitive
icontains-alloutput contains all list of substrings, case insensitive
icontains-anyoutput contains any of the listed substrings, case insensitive
javascriptprovided Javascript function validates the output
levenshteinLevenshtein distance is below a threshold
regexoutput matches regex
otherEmail [email protected] for additional metric support

(2) Review experiment results

Navigate to the experiments tab once configured. Tap into the evaluation you want to review.

Experiments run once, and will have as many variants as you define. You'll get a weekly email summary.


Watch a video overview


Email [email protected] with any questions.