Prompt Caching
Author: PangPangGod
Peer Review : byoon, Wonyoung Lee
Proofread : BokyungisaGod
This is a part of LangChain Open Tutorial
Overview
Prompt caching is a powerful feature that optimizes API usage by enabling resumption from specific prefixes in your prompts. This method greatly reduces processing time and costs for repetitive tasks or prompts with consistent components.
Prompt Caching is especially useful for this situations:
Prompts with many examples
Large amounts of context or background information
Repetitive tasks with consistent instructions
Long multi-turn conversations
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
Fetch Data
The easiest way to verify prompt caching is by including large amounts of context or background information. To demonstrate this, I have provided a simple example using a long document retrieved from Wikipedia.
OpenAI
OpenAI Prompt Caching works automatically on all your API requests (no code changes required) and has no additional fees associated with it. This can reduce latency by up to 80% and costs by 50% for long prompts. Caching is available for prompts containing 1024 tokens or more.
Models Supporting Prompt Caching
gpt-4o (excludes gpt-4o-2024-05-13 and chatgpt-4o-latest)
50% less
n/a
gpt-4o-mini
50% less
n/a
gpt-4o-realtime-preview
50% less
80% less
o1-preview
50% less
n/a
o1-mini
50% less
n/a
for detailed reference, please check link below. OpenAI Prompt caching
Anthropic
Anthropic Prompt Caching provides the following token limits for caching:
1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
[Note]
Shorter prompts cannot be cached, even if marked with
cache_control.The cache has a 5-minute time to live (TTL). Currently,
ephemeralis the only supported cache type, corresponding to this 5-minute lifetime.
Models Supporting Prompt Caching
Claude 3.5 Sonnet
Claude 3.5 Haiku
Claude 3 Haiku
Claude 3 Opus
While it has the drawback of requiring adherence to the Anthropic Message Style, a key advantage of Anthropic Prompt Caching is that it enables caching with fewer tokens.
For detailed reference, please check link below. Anthropic Prompt Caching Documentation
GoogleAI
Google refers to it as Context Caching, not Prompt Caching, and it is primarily used for analyzing various data types, such as code analysis, large document collections, long videos, and multiple audio files.
Therefore, we will demonstrate how to use caching in google.generativeai through ChatGoogleGenerativeAI from langchain_google_genai.
For more information, please refer to the following links:
Fetching Data For GoogleAI Context Caching
At least 32,768 tokens are required for Prompt Caching (which Google refers to as Context Caching). Therefore, we decided to implement this in a simple way and demonstrate its usage by including three lengthy Wikipedia documents.
Last updated