Request caching
Return faster results and reduce costs on LLM calls.
Round-trip times to LLM providers can be lengthy, upwards of 2-3 seconds per request. With caching, we return results to identical queries in milliseconds. You also won't pay the LLM provider for the generated response.
Configure caching
To enable caching, first make sure the Velvet proxy is configured correctly when initializing your chosen provider. Then add velvet-cache-enabled as a header when sending a request to the provider’s endpoint.
See example code snippets from each provider:
How it works
Cache key
By default, the cache key is a hash of the options in the request body.
For example, you send the following with cache enabled:
{
messages: [{ role: "system", content: "You are a helpful assistant." }],
model: "gpt-4o",
temperature: 0.3,
}
All subsequent requests with an identical payload will return the cached response.
If you change or add a key to the payload, you'll create a new cached request. For example, if you send this same payload with temperature: 0.3
and then send another with temperature: 0
you'll have two different cached requests.
If you want to use a different cache key strategy, you can send a velvet-cache-key
header to override the default behavior.
Response headers
If the velvet-cache-enabled
header is set, the gateway will respond with a velvet-cache-status
header.
velvet-cache-status
will be one of HIT
, MISS
, NONE/UNKNOWN
Response body
The response body will be returned as identically as possible to how the LLM provider would respond to the request. That means you shouldn't need to change your code to handle cached vs non-cached responses.
For example, a request to /chat/completions
with streaming enabled and caching enabled will still return a SSE (server sent event) formatted response.
Cache TTL and revalidation
Velvet allows you to create cache keys with automatic expiration. By default, the cache doesn't expire. You can set a time-to-live (TTL) expiration using max-age={TTL}
in the velvet-cache-ttl
header. For instance, max-age=300
sets a 5-minute expiration on the cache key. The TTL is measured in seconds.
The velvet-cache-ttl
header also supports cache invalidation. To invalidate a cached item, set max-age=0
in the velvet-cache-ttl header. An invalidation request will refresh the cache with new data.
See example code snippets for each provider:
Log metadata
Caching unlocks additional metadata stored with each log. Refer to this example when querying cached requests.
{
"cache": {
"key": "4b2af868add63c97308b3133062aed384afb1be7fd81f225da3b8d113d8af086",
"value": "log_gz42yh5ecgd2e22q",
"status": "HIT",
"enabled": true
},
"model": "gpt-4o-2024-05-13",
"stream": false,
"cost": {
"input_cost": 0,
"total_cost": 0,
"output_cost": 0,
"input_cost_cents": 0,
"total_cost_cents": 0,
"output_cost_cents": 0
},
"usage": {
"model": "gpt-4o-2024-05-13",
"total_tokens": 0,
"calculated_by": "js-tiktoken",
"prompt_tokens": 0,
"completion_tokens": 0
},
"expected_cost": {
"input_cost": 0.00585,
"total_cost": 0.00669,
"output_cost": 0.00084,
"input_cost_cents": 0.585,
"total_cost_cents": 0.669,
"output_cost_cents": 0.084
},
"expected_usage": {
"model": "gpt-4o-2024-05-13",
"total_tokens": 1226,
"calculated_by": "openai",
"prompt_tokens": 1170,
"completion_tokens": 56
},
}
Additional caching considerations
Some providers offer prompt caching, which will save you money on input tokens. This means you can reuse the exact same prompt at a discount. See Anthropic's announcement here. You may consider experimenting with this feature alongside Velvet's response caching features.
Velvet supports response caching (see docs above), to reproduce a near identical response without additional inference cost.
Updated 2 months ago