Request caching

Return faster results and reduce costs on LLM calls.

Round-trip times to LLM providers can be lengthy, upwards of 2-3 seconds per request. With caching, we return results to identical queries in milliseconds. You also won't pay the LLM provider for the generated response.

Configure caching

To enable caching, first make sure the Velvet proxy is configured correctly when initializing your chosen provider. Then add velvet-cache-enabled as a header when sending a request to the provider’s endpoint.

See example code snippets from each provider:


How it works

Cache key

By default, the cache key is a hash of the options in the request body.

For example, you send the following with cache enabled:

{
  messages: [{ role: "system", content: "You are a helpful assistant." }],
  model: "gpt-4o",
  temperature: 0.3,
}

All subsequent requests with an identical payload will return the cached response.

If you change or add a key to the payload, you'll create a new cached request. For example, if you send this same payload with temperature: 0.3 and then send another with temperature: 0 you'll have two different cached requests.

If you want to use a different cache key strategy, you can send a velvet-cache-key header to override the default behavior.


Response headers

If the velvet-cache-enabled header is set, the gateway will respond with a velvet-cache-status header.

velvet-cache-status will be one of HIT, MISS, NONE/UNKNOWN


Response body

The response body will be returned as identically as possible to how the LLM provider would respond to the request. That means you shouldn't need to change your code to handle cached vs non-cached responses.

For example, a request to /chat/completions with streaming enabled and caching enabled will still return a SSE (server sent event) formatted response.


Cache TTL and revalidation

Velvet allows you to create cache keys with automatic expiration. By default, the cache doesn't expire. You can set a time-to-live (TTL) expiration using max-age={TTL} in the velvet-cache-ttl header. For instance, max-age=300 sets a 5-minute expiration on the cache key. The TTL is measured in seconds.

The velvet-cache-ttl header also supports cache invalidation. To invalidate a cached item, set max-age=0 in the velvet-cache-ttl header. An invalidation request will refresh the cache with new data.

See example code snippets for each provider:


Log metadata

Caching unlocks additional metadata stored with each log. Refer to this example when querying cached requests.

{
  "cache": {
    "key": "4b2af868add63c97308b3133062aed384afb1be7fd81f225da3b8d113d8af086",
    "value": "log_gz42yh5ecgd2e22q",
    "status": "HIT",
    "enabled": true
  },
  "model": "gpt-4o-2024-05-13",
  "stream": false,
  "cost": {
    "input_cost": 0,
    "total_cost": 0,
    "output_cost": 0,
    "input_cost_cents": 0,
    "total_cost_cents": 0,
    "output_cost_cents": 0
  },
  "usage": {
    "model": "gpt-4o-2024-05-13",
    "total_tokens": 0,
    "calculated_by": "js-tiktoken",
    "prompt_tokens": 0,
    "completion_tokens": 0
  },
  "expected_cost": {
    "input_cost": 0.00585,
    "total_cost": 0.00669,
    "output_cost": 0.00084,
    "input_cost_cents": 0.585,
    "total_cost_cents": 0.669,
    "output_cost_cents": 0.084
  },
  "expected_usage": {
    "model": "gpt-4o-2024-05-13",
    "total_tokens": 1226,
    "calculated_by": "openai",
    "prompt_tokens": 1170,
    "completion_tokens": 56
  },
}

Additional caching considerations

Some providers offer prompt caching, which will save you money on input tokens. This means you can reuse the exact same prompt at a discount. See Anthropic's announcement here. You may consider experimenting with this feature alongside Velvet's response caching features.

Velvet supports response caching (see docs above), to reproduce a near identical response without additional inference cost.