Skip to main content

Prompt caching in elvex

Understand how elvex automatically uses provider-specific caches to save money and make things faster.

Updated over a week ago

What is prompt caching?

Model prompts often contain repetitive content like common instructions in your agent that are sent along with every request. Conversations also often include historical messages whose content does not change between message turns (unless you fork a message).

The longer the prompt is, the slower models are to respond (this is a process called "inference") and the more costly the request. This is because prompts need to go through a pre-processing phase where, among other things, prompts of text are converted into tokens.

As the prompt approaches the context window for a model, cost and latency penalties become more exaggerated.

Some providers support the ability to cache (or save) prompts that have been sent in previous requests. Effectively this means saving the results of the pre-processing so it doesn't need to be done per-request. Doing so can reduce latency by up to 80% and cost a fraction (as much as 0.1x) of the normal price for input tokens (prompt caching has no effect on output tokens).

How much can prompt caching save?

Many factors can play here including the instructions for your agent, the length of the conversation, actions you may be using as well as the provider and model making it hard to estimate exactly, but the general rule of thumb is, the more information you're using in an agent or conversation, the more important prompt caching will become.

To get a rough sense, consult model pages like OpenAI's or Anthropic's to understand what savings will be applied.

Does elvex use prompt caching? If so, which providers?

Where possible, yes, elvex always tries to utilize prompt caching to save you money and reduce latency in responses from model providers.

The following are the providers which utilize prompt caching in elvex:

  • OpenAI (including Azure)

  • Anthropic

  • Google

  • Bedrock

How does elvex optimize prompt caching?

Prompt caching works on a "longest matching prefix" approach, which means the AI provider caches the beginning portion of a prompt that matches exactly with previous requests.

elvex optimizes for this by strategically ordering information in the system message:

  • Static content first: Agent instructions, rules, and other unchanging information are placed at the beginning of the prompt

  • Personalized content last: User-specific details (like your name, email, and the current date) are placed at the very end of the system message

This means that for each agent (including the home agent), a large portion of the first message benefits from caching. However, once elvex adds personalized information, it invalidates the remainder of the cache.

When caching works best

  • The same user making requests on the same day will benefit from cache reads (only a few new tokens need to be written)

  • Within a single conversation, all previous messages in the conversation history remain cached, so the model doesn't need to re-calculate token vectors for those messages

When cache writes occur

  • Different users making requests (even to the same agent) will trigger cache writes due to different personalized information

  • The same user making requests on different days will trigger partial cache writes due to the date change

elvex also optimizes by sending only the current date (in your timezone) rather than timestamps down to the second, which would invalidate the cache on every single request.

Is there a catch to using this?

Writing to most provider caches tends to cost more than the nominal price for input tokens. Cache writes are infrequent compared to cache reads so this trade off usually balances out especially when using a good part of the context window for a model.

Can I see the cost savings of prompt caching in elvex?

Not at this time, but this is functionality we may add in the future.

Did this answer your question?