Prompt caching is a powerful feature that optimizes API usage by enabling resumption from specific prefixes in your prompts.
This method greatly reduces processing time and costs for repetitive tasks or prompts with consistent components.
Prompt Caching is especially useful for this situations:
Prompts with many examples
Large amounts of context or background information
The easiest way to verify prompt caching is by including large amounts of context or background information.
To demonstrate this, I have provided a simple example using a long document retrieved from Wikipedia.
import urllib.parse
import urllib.request
import json
def fetch_wikipedia_page(title: str, lang: str = "en"):
"""
Fetch the content of a Wikipedia page using the Wikipedia API.
Args:
title (str): The title of the Wikipedia page to fetch.
lang (str): The language code for the Wikipedia (default: "en").
Returns:
str: The plain text content of the Wikipedia page.
"""
# Wikipedia API endpoint
endpoint = f"https://{lang}.wikipedia.org/w/api.php"
# Query parameters
params = {
"action": "query",
"format": "json",
"prop": "extracts",
"titles": title,
"explaintext": True
}
# Encode the parameters and create the URL
url = f"{endpoint}?{urllib.parse.urlencode(params)}"
# Send the request and read the response
with urllib.request.urlopen(url) as response:
data = json.load(response)
# Extract page content
pages = data.get("query", {}).get("pages", {})
for page_id, page in pages.items():
if "extract" in page:
return page["extract"]
return "No content found for the given title."
# fetch data from wikipedia
title = "World War II"
content = fetch_wikipedia_page(title)
OpenAI
OpenAI Prompt Caching works automatically on all your API requests (no code changes required) and has no additional fees associated with it.
This can reduce latency by up to 80% and costs by 50% for long prompts. Caching is available for prompts containing 1024 tokens or more.
Models Supporting Prompt Caching
Model
Text Input Cost
Audio Input Cost
gpt-4o (excludes gpt-4o-2024-05-13 and chatgpt-4o-latest)
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
#The {content} is sourced from the Wikipedia article mentioned above.
"You are an assistant who answers questions based on the provided document.\n<document>{content}</document>"
),
(
"human",
"{question}"
)
]
)
chain = prompt | llm
first_response = chain.invoke({"content": content,"question":"When did Australia and New Zealand join the war?"})
second_response = chain.invoke({"content": content,"question":"Where did the first battle between Australia, New Zealand, and Japan take place?"})
# You can see only cache read in 'prompt_tokens_details' -> 'cached_tokens' in langchain 0.3.29 OpenAI calls.
print(f"Answer: {first_response.content}")
print(f"Token Usage: {first_response.response_metadata}")
print()
print(f"Caching Answer: {second_response.content}")
print(f"Token Usage: {second_response.response_metadata}")
Answer: Australia and New Zealand joined World War II shortly after the outbreak of the war in Europe. Both countries declared war on Germany on 3 September 1939, following the United Kingdom's declaration of war on Germany after the invasion of Poland.
Token Usage: {'token_usage': {'completion_tokens': 49, 'prompt_tokens': 17389, 'total_tokens': 17438, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_72ed7ab54c', 'finish_reason': 'stop', 'logprobs': None}
Caching Answer: The first battle between Australia, New Zealand, and Japan took place at the Battle of Rabaul, which occurred in January 1942. This battle was part of the broader conflict in the Pacific during World War II.
Token Usage: {'token_usage': {'completion_tokens': 46, 'prompt_tokens': 17395, 'total_tokens': 17441, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 17152}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_72ed7ab54c', 'finish_reason': 'stop', 'logprobs': None}
Anthropic
Anthropic Prompt Caching provides the following token limits for caching:
1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
[Note]
Shorter prompts cannot be cached, even if marked with cache_control.
The cache has a 5-minute time to live (TTL). Currently, ephemeral is the only supported cache type, corresponding to this 5-minute lifetime.
Models Supporting Prompt Caching
Claude 3.5 Sonnet
Claude 3.5 Haiku
Claude 3 Haiku
Claude 3 Opus
While it has the drawback of requiring adherence to the Anthropic Message Style, a key advantage of Anthropic Prompt Caching is that it enables caching with fewer tokens.
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model = "claude-3-5-haiku-latest")
messages = [
{
"role": "system",
"content": [{
"type": "text",
#The {content} is sourced from the Wikipedia article mentioned above.
"text": f"You are an assistant who answers questions based on the provided document.\n<document>{content}</document>",
"cache_control": {"type": "ephemeral"}
}]
},
{
"role": "user",
"content": [{"type": "text", "text": "Why was Yugoslavia invaded?"}]
}
]
first_response = llm.invoke(messages)
print(f"Answer: {first_response.content}")
# You can see cache read in 'input_token_details' -> 'cache_creation_tokens' or 'cache_read_input_tokens'.
print(f"Token Usage: {first_response.response_metadata}")
Answer: According to the document, Yugoslavia was invaded by Germany and Italy as part of their broader operations in the Balkans. The specific details are mentioned in this passage:
"By late March 1941, Bulgaria and Yugoslavia signed the Tripartite Pact; however, the Yugoslav government was overthrown two days later by pro-British nationalists. Germany and Italy responded with simultaneous invasions of both Yugoslavia and Greece, commencing on 6 April 1941; both nations were forced to surrender within the month."
The invasion appears to have been a response to the overthrow of the government that had previously signed the Tripartite Pact. Germany and Italy saw this as a threat to their strategic interests in the region and quickly moved to occupy Yugoslavia. After the invasion, partisan warfare broke out against the Axis occupation, which continued until the end of the war.
Token Usage: {'id': 'msg_01N6edkmZ6NGT5RmZs85uFya', 'model': 'claude-3-5-haiku-20241022', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'cache_creation_input_tokens': 18837, 'cache_read_input_tokens': 0, 'input_tokens': 12, 'output_tokens': 186}}
messages = [
{
"role": "system",
"content": [{
"type": "text",
#The {content} is sourced from the Wikipedia article mentioned above.
"text": f"You are an assistant who answers questions based on the provided document.\n<document>{content}</document>",
"cache_control": {"type": "ephemeral"}
}]
},
{
"role": "user",
"content": [{"type": "text", "text": "Where was invaded after Yugoslavia?"}]
}
]
second_response = llm.invoke(messages)
print(f"Answer: {second_response.content}")
# You can see cache read in 'input_token_details' -> 'cache_creation_tokens' or 'cache_read_input_tokens'.
print(f"Token Usage: {second_response.response_metadata}")
Answer: According to the document, after Yugoslavia was invaded by Germany and Italy, Greece was also invaded. Specifically, the text states: "Germany and Italy responded with simultaneous invasions of both Yugoslavia and Greece, commencing on 6 April 1941; both nations were forced to surrender within the month."
Token Usage: {'id': 'msg_019t8wXVpXpYbasNRb7WBrsv', 'model': 'claude-3-5-haiku-20241022', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 18837, 'input_tokens': 13, 'output_tokens': 66}}
GoogleAI
Google refers to it as Context Caching, not Prompt Caching, and it is primarily used for analyzing various data types, such as code analysis, large document collections, long videos, and multiple audio files.
Therefore, we will demonstrate how to use caching in google.generativeai through ChatGoogleGenerativeAI from langchain_google_genai.
For more information, please refer to the following links:
At least 32,768 tokens are required for Prompt Caching (which Google refers to as Context Caching).
Therefore, we decided to implement this in a simple way and demonstrate its usage by including three lengthy Wikipedia documents.
longest_featured_list_in_wikipedia = "List of Falcon 9 and Falcon Heavy launches"
falcon_wiki = fetch_wikipedia_page(longest_featured_list_in_wikipedia)
longest_biography_in_wikipedia = "Vladimir Putin"
putin_wiki = fetch_wikipedia_page(longest_biography_in_wikipedia)
python_wiki_page = "Python (programming language)"
python_wiki = fetch_wikipedia_page(python_wiki_page)
from langchain_google_genai import ChatGoogleGenerativeAI
from google.generativeai import caching
import datetime
cache = caching.CachedContent.create(
model='models/gemini-1.5-flash-001',
display_name='wikipedia-document-pages', # used to identify the cache.
system_instruction=(
'You are an expert in analyze very long text, and your job is to answer '
'the user\'s query based on the video file you have access to.'
), # if long, complex system instruction needed, you can provide with this format.
contents=[falcon_wiki, putin_wiki, python_wiki], # you can pass each documents in list format.
ttl=datetime.timedelta(minutes=5),
)
print(cache) # When caching, the model name provided must be the same when creating an instance of ChatGoogleGenerativeAI.
llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-flash-001", cached_content=cache.name) # provide cache's name parameter for trackability.
response = llm.invoke("How many Falcon Rockets launch conducted in 2025?")
print(response.content)
print(response.usage_metadata) # you can see 'input_token_details' actually works!
The text states that as of February 4th, 2025, SpaceX has conducted **15** Falcon family launches in 2025. All of these launches were conducted using the Falcon 9 rocket, with no Falcon Heavy launches.
{'input_tokens': 43408, 'output_tokens': 53, 'total_tokens': 43461, 'input_token_details': {'cache_read': 43394}}