Monitoring

Continuously test models, settings, and metrics in production

Sample logs from your LLM-powered features in production, and get weekly alerts on performance.

How it works

  1. Select the dataset and frequency of test
  2. Configure evaluation
  3. Review results and get weekly updates

(1) Create a new test

Navigate to the evaluations tab inside your workspace to get started.

  1. Click the button for 'new evaluation' and select 'monitoring'.
  2. Select the dataset (logs) you want to run tests against.
  3. Select which model you want to test. The default is pulled from the selected log.
  4. Select the metric you want to test. The default is pulled from the selected log.


The gateway supports these providers and models out-of-the-box, including all their versions, like gpt-4o-2024-11-20 and claude-3-5-sonnet-20241022. We can support additional providers and models on our paid plans.

Provider supportModel
OpenAIo1-preview
o1-mini
gpt-4o
gpt-4o-mini
gpt-4-turbo
gpt-4
gpt-3.5-turbo
more
Anthropicclaude-3-5-sonnet
claude-3-5-haiku
claude-3-opus
claude-3-haiku
claude-3-sonnet
more
OtherEmail [email protected] for additional provider support

You can leverage these metrics when configuring an evaluation from the Velvet app. For additional flexibility, see our API configuration docs.

Metric supportDescription
latencylatency is below a threshold (milliseconds)
costInference cost is below a threshold
llm-rubricLLM output matches a given rubric, using a Language Model to grade output
equalsoutput matches exactly
is-jsonoutput is valid json (optional json schema validation)
perplexityPerplexity is below a threshold
SimilarEmbeddings and cosine similarity are above a threshold
answer-relevanceEnsure that LLM output is related to original query
context-faithfulnessEnsure that LLM output uses the context
context-recallEnsure that ground truth appears in context
context-relevanceEnsure that context is relevant to original query
factualityLLM output adheres to the given facts, using Factuality method from OpenAI eval
model-graded-close-qaLLM output adheres to given criteria, using Closed QA method from OpenAI eval
select-bestCompare multiple outputs for a test case and pick the best one
containsoutput contains substring
contains-alloutput contains all list of substrings
contains-anyoutput contains any of the listed substrings
contain-jsonoutput contains valid json (optional json schema validation)
icontainsoutput contains substring, case insensitive
icontains-alloutput contains all list of substrings, case insensitive
icontains-anyoutput contains any of the listed substrings, case insensitive
javascriptprovided Javascript function validates the output
levenshteinLevenshtein distance is below a threshold
regexoutput matches regex
otherEmail [email protected] for additional metric support

(2) Review ongoing test results

Navigate to the tests tab once configured. Tap into the evaluation you want to review.

Tests will run continuously at the defined interval. You'll get a weekly email summary of ongoing tests.


Watch a video overview

Email [email protected] with any questions.