Prompt Caching

Open in ColabOpen in GitHub

Overview

Prompt caching is a powerful feature that optimizes API usage by enabling resumption from specific prefixes in your prompts. This method greatly reduces processing time and costs for repetitive tasks or prompts with consistent components.

Prompt Caching is especially useful for this situations:

  • Prompts with many examples

  • Large amounts of context or background information

  • Repetitive tasks with consistent instructions

  • Long multi-turn conversations

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Fetch Data

The easiest way to verify prompt caching is by including large amounts of context or background information. To demonstrate this, I have provided a simple example using a long document retrieved from Wikipedia.

OpenAI

OpenAI Prompt Caching works automatically on all your API requests (no code changes required) and has no additional fees associated with it. This can reduce latency by up to 80% and costs by 50% for long prompts. Caching is available for prompts containing 1024 tokens or more.

Models Supporting Prompt Caching

Model
Text Input Cost
Audio Input Cost

gpt-4o (excludes gpt-4o-2024-05-13 and chatgpt-4o-latest)

50% less

n/a

gpt-4o-mini

50% less

n/a

gpt-4o-realtime-preview

50% less

80% less

o1-preview

50% less

n/a

o1-mini

50% less

n/a

for detailed reference, please check link below. OpenAI Prompt caching

Anthropic

Anthropic Prompt Caching provides the following token limits for caching:

  • 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus

  • 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku

[Note]

  • Shorter prompts cannot be cached, even if marked with cache_control.

  • The cache has a 5-minute time to live (TTL). Currently, ephemeral is the only supported cache type, corresponding to this 5-minute lifetime.

Models Supporting Prompt Caching

  • Claude 3.5 Sonnet

  • Claude 3.5 Haiku

  • Claude 3 Haiku

  • Claude 3 Opus

While it has the drawback of requiring adherence to the Anthropic Message Style, a key advantage of Anthropic Prompt Caching is that it enables caching with fewer tokens.

For detailed reference, please check link below. Anthropic Prompt Caching Documentation

GoogleAI

Google refers to it as Context Caching, not Prompt Caching, and it is primarily used for analyzing various data types, such as code analysis, large document collections, long videos, and multiple audio files.

Therefore, we will demonstrate how to use caching in google.generativeai through ChatGoogleGenerativeAI from langchain_google_genai.

For more information, please refer to the following links:

Fetching Data For GoogleAI Context Caching

At least 32,768 tokens are required for Prompt Caching (which Google refers to as Context Caching). Therefore, we decided to implement this in a simple way and demonstrate its usage by including three lengthy Wikipedia documents.

Last updated