Control LLM costs with context caching

Context caching

Introduction

Some large language models (LLMs), such as Gemini 1.5 Flash or Gemini 1.5 Pro, have a very large context window. This is very useful if you want to analyze a big chunk of data, such as a whole book or a long video. On the other hand, it can get quite expensive if you keep sending the same large data in your prompts. Context caching can help.

Context caching is useful in reducing costs when a substantial context is referenced repeatedly by shorter requests such as:

Chatbots with extensive system instructions
Repetitive analysis of lengthy video files
Recurring queries against large document sets
Frequent code repository analysis or bug fixing

Let’s take a look through an example main.py.

Without cached content

First, let’s ask a question to LLM without context caching:

model = GenerativeModel('gemini-1.5-flash-001')
prompt = "What are the papers about?"
response = model.generate_content(prompt)

Run it:

python main.py --project_id your-project-id generate_content

Not surprisingly, the LLM does not know what papers you’re referring to:

Prompt: What are the papers about?
Response: Please provide me with the papers you are referring to.
I need the titles, authors, and any other relevant information about the papers
for me to be able to tell you what they are about.

Create cached content

Let’s create a cached content that can be used as context. The content has some system instructions and refers to a couple of PDFs:

system_instruction = """
You are an expert researcher. You always stick to the facts in the sources provided, and never make up new facts.
Now look at these research papers, and answer the following questions.
"""

contents = [
    Part.from_uri(
        "gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
        mime_type="application/pdf",
    ),
    Part.from_uri(
        "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
        mime_type="application/pdf",
    ),
]

cached_content = caching.CachedContent.create(
    model_name="gemini-1.5-pro-001",
    system_instruction=system_instruction,
    contents=contents,
    ttl=datetime.timedelta(minutes=60),
)

Run it:

python main.py --project_id your-project-id create_cached_content

You get back the information about the cached content:

Cached content: <vertexai.caching._caching.CachedContent object at 0x12b652e50>: {
  "name": "projects/207195257545/locations/us-central1/cachedContents/3207266622129569792",
  "model": "projects/genai-atamel/locations/us-central1/publishers/google/models/gemini-1.5-pro-001",
  "createTime": "2024-06-28T09:59:37.340430Z",
  "updateTime": "2024-06-28T09:59:37.340430Z",
  "expireTime": "2024-06-28T10:59:37.329458Z"
}

By default, the cache has an expiration time of 60 minutes.

Generate with cached content

Now, let’s try generating with the cache content we just created:

cached_content = caching.CachedContent(cached_content_name=cache_id)
model = GenerativeModel.from_cached_content(cached_content=cached_content)
prompt = "What are the papers about?"
response = model.generate_content(prompt)

Run it:

python main.py --project_id your-project-id generate_content \
  --cache_id projects/207195257545/locations/us-central1/cachedContents/3207266622129569792

The model now knows what the papers are about from the cached content:

Prompt: What are the papers about?
Response: The provided text describes the Gemini family of models, versions 1.0 and 1.5, focusing on 1.5 Pro. 

**Gemini 1.0** is a family of multimodal models trained on text, ...

That’s great.

The bigger the cached content and the more prompts use that content, you’ll be saving more money in the long run. The number of input tokens cached are billed at a reduced rate when included in subsequent prompts.

Delete cached content

You can either let the cache expire or delete it manually before the expiration:

cached_content = caching.CachedContent(cached_content_name=cache_id)
cached_content.delete()

Run it:

python main.py --project_id your-project-id delete_cached_content \
  --cache_id projects/207195257545/locations/us-central1/cachedContents/3207266622129569792

The cache deleted:

Cached content deleted: 3207266622129569792

Conclusion

In this blog post, I explored how you can control LLM costs with context caching. If you want to learn more, here are some more resources:

GenAI VertexAI Gemini Google Cloud Platform

Introduction

Without cached content

Create cached content

Generate with cached content

Delete cached content

Conclusion

See also