Skip to main content

Prompt caching in elvex

Understand how elvex automatically uses provider-specific caches to save money and make things faster.

Updated over 2 weeks ago

What is prompt caching?

Model prompts often contain repetitive content like common instructions in your assistant that are sent along with every request. Conversations also often include historical messages whose content does not change between messages turns (unless you fork a message).

The longer the prompt is, the slower models are to respond (this is a process called "inference") and the more costly the request. This is because prompts need to go through a pre-processing phase where, among other things, prompts of text are converted into tokens.

As the prompt approaches the context window for a model, cost and latency penalties become more exaggerated.

Some providers support the ability to cache (or save) prompts that have been sent in previous requests. Effectively this means saving the results of the pre-processing so it doesn't need to be done per-request. Doing so can reduce latency by up to 80% and cost a fraction (as much as 0.1x) of the normal price for input tokens (prompt caching has no effect on output tokens).

How much can prompt caching save?

Many factors can play here including the instructions for your assistant, the length of the conversation, actions you may be using as well as the provider and model making it hard to estimate exactly, but the general rule of thumb is, the more information you're using in an assistant or conversation, the more important prompt caching will become.

To get a rough sense, consult model pages like OpenAI's or Anthropic's to understand what savings will be applied.

Does elvex use prompt caching? If so, which providers?

Where possible, yes, elvex always tries to utilize prompt caching to save you money and reduce latency in responses from model providers.

The following are the providers which utilize prompt caching in elvex:

  • OpenAI (including Azure)

  • Anthropic

  • Google

  • Bedrock

Is there a catch to using this?

Writing to most provider caches tends to cost more than the nominal price for input tokens. E.g. at the time of writing, Anthropic's Sonnet 4 model costs $3 / million input tokens normally but $3.75 / million input tokens when writing to the cache.

Cache writes are infrequent however compared to cache reads so this trade off usually balances out especially when using a good part of the context window for a model.

Can I see the cost savings of prompt caching in elvex?

Not at this time, but this is functionality we may add in the future.

Did this answer your question?