LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Pinecone
    • Pinecone
    • Vector Stores
    • Chroma-Multimodal
    • Chroma
    • Chroma With Langchain
    • Pinecone-Multimodal.ipynb
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j Vector Index
    • Weaviate
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Configure
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • 15-CoT-basedSmartWebSearch
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • InMemoryCache
  • SQLiteCache
  • Setup Local LLM with VLLM
  • Device & Serving information - Windows
  • InMemoryCache + Local VLLM
  • SQLite Cache + Local VLLM
  1. 04-Model

Caching VLLM

PreviousCachingNextModel Serialization

Last updated 1 month ago

  • Author:

  • Peer Review : ,

  • This is a part of

Overview

LangChain provides optional caching layer for LLMs.

This is useful for two reasons:

  • When requesting the same completions multiple times, it can reduce the number of API calls to the LLM provider and thus save costs.

  • By reduing the number of API calls to the LLM provider, it can improve the running time of the application.

But sometimes you need to deploy your own LLM service, like on-premise system where you cannot reach cloud services. In this tutorial, we will use vllm OpenAI compatible API and utilize two kinds of cache, InMemoryCache and SQLiteCache. At end of each section we will compare wall times between before and after caching.

Even though this is a tutorial for local LLM service case, we will remind you about how to use cache with OpenAI API service first.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_openai",
        # "vllm", # this is for optional section
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "You OpenAI API KEY",
        "LANGCHAIN_API_KEY": "LangChain API KEY",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Caching",
    }
)
Environment variables have been set successfully.
# Alternatively, one can set environmental variables with load_dotenv
from dotenv import load_dotenv


load_dotenv(override=True)
False
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

# Create model
llm = ChatOpenAI(model_name="gpt-4o-mini")

# Generate prompt
prompt = PromptTemplate.from_template(
    "Sumarize about the {country} in about 200 characters"
)

# Create chain
chain = prompt | llm
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea, located on the Korean Peninsula, is known for its rich culture, technological advancements, and vibrant economy. It features a mix of traditional heritage and modern innovation, highlighted by K-pop and cuisine.
    CPU times: total: 93.8 ms
    Wall time: 1.09 s

InMemoryCache

First, cache the answer to the same question using InMemoryCache.

from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

# Set InMemoryCache
set_llm_cache(InMemoryCache())
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea is a technologically advanced country known for its fast-paced lifestyle, vibrant culture, and delicious cuisine. It is a leader in industries such as electronics, automotive, and entertainment. The country also has a rich history and beautiful landscapes, making it a popular destination for tourists.
    CPU times: total: 0 ns
    Wall time: 996 ms

Now we invoke the chain with the same question.

%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea is a technologically advanced country known for its fast-paced lifestyle, vibrant culture, and delicious cuisine. It is a leader in industries such as electronics, automotive, and entertainment. The country also has a rich history and beautiful landscapes, making it a popular destination for tourists.
    CPU times: total: 0 ns
    Wall time: 3 ms

Note that if we set InMemoryCache again, the cache will be lost and the wall time will increase.

set_llm_cache(InMemoryCache())
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea is a tech-savvy, modern country known for its vibrant culture, delicious cuisine, and booming economy. It is a highly developed nation with advanced infrastructure, high standards of living, and a strong emphasis on education. The country also has a rich history and is famous for its K-pop music and entertainment industry.
    CPU times: total: 0 ns
    Wall time: 972 ms

SQLiteCache

Now, we cache the answer to the same question by using SQLiteCache.

from langchain_community.cache import SQLiteCache
from langchain_core.globals import set_llm_cache
import os

# Create cache directory
if not os.path.exists("cache"):
    os.makedirs("cache")

# Set SQLiteCache
set_llm_cache(SQLiteCache(database_path="cache/llm_cache.db"))
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea is a technologically advanced country in East Asia, known for its booming economy, vibrant pop culture, and rich history. It is home to K-pop, Samsung, and delicious cuisine like kimchi. The country also faces tensions with North Korea and strives for reunification.
    CPU times: total: 31.2 ms
    Wall time: 953 ms

Now we invoke the chain with the same question.

%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea is a technologically advanced country in East Asia, known for its booming economy, vibrant pop culture, and rich history. It is home to K-pop, Samsung, and delicious cuisine like kimchi. The country also faces tensions with North Korea and strives for reunification.
    CPU times: total: 375 ms
    Wall time: 375 ms

Note that if we use SQLiteCache, setting caching again does not delete stored cache.

set_llm_cache(SQLiteCache(database_path="cache/llm_cache.db"))
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)
South Korea is a technologically advanced country in East Asia, known for its booming economy, vibrant pop culture, and rich history. It is home to K-pop, Samsung, and delicious cuisine like kimchi. The country also faces tensions with North Korea and strives for reunification.
    CPU times: total: 0 ns
    Wall time: 4.01 ms

Setup Local LLM with VLLM

vLLM supports various cases, but for the most stable setup we utilize docker to serve local LLM model with vLLM.

Device & Serving information - Windows

  • CPU : AMD 5600X

  • OS : Windows 10 Pro

  • RAM : 32 Gb

  • GPU : Nividia 3080Ti, 12GB VRAM

  • CUDA : 12.6

  • Driver Version : 560.94

  • Docker Image : nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04

  • model : Qwen/Qwen2.5-0.5B-Instruct

  • Python version : 3.10

  • docker run script :

    docker run -itd --name vllm --gpus all --entrypoint /bin/bash -p 6001:8888 nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
  • vllm serving script :

    python3 -m vllm.entrypoints.openai.api_server --model='Qwen/Qwen2.5-0.5B-Instruct' --served-model-name 'qwen-2.5' --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.80 --max-model-len 4096 --swap-space 1 --dtype bfloat16 --tensor-parallel-size 1 
from langchain_community.llms import VLLMOpenAI

# create model using OpenAI compatible class VLLMOpenAI
llm = VLLMOpenAI(
    model="qwen-2.5", openai_api_key="EMPTY", openai_api_base="http://localhost:6001/v1"
)

# Generate prompt
prompt = PromptTemplate.from_template(
    "Sumarize about the {country} in about 200 characters"
)

# Create chain
chain = prompt | llm

InMemoryCache + Local VLLM

Same InMemoryCache section above, we set InMemoryCache.

from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

# Set InMemoryCache
set_llm_cache(InMemoryCache())

Invoke chain with local LLM, do note that we print response not response.content

%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)
.
    South Korea is a country in East Asia, with a population of approximately 55.2 million as of 2023. It borders North Korea to the east, Japan to the northeast, and China to the southeast. The country is known for its advanced technology, leading industries, and significant contributions to South Korean culture. It is often referred to as the "Globe and a Couple" due to its diverse landscapes, rich history, and frontiers with neighboring countries. South Korea's economy is growing, with a strong technological sector and a strong economy, making it a significant player on the global stage. Overall, South Korea is a significant global player, with a rich history, advanced technology, and a cultural influence. With its advanced technology and unique culture, South Korea is a fascinating country to explore. Its diverse landscapes, rich history, and remarkable economic performance have made it a popular destination for travelers. South Korea's contribution to the global economy and its strong technological sector have made it a significant player on the world stage. Its cultural influence and trade partnerships have created a unique culture that is hard to replicate elsewhere. South Korea's diverse landscapes, rich history, and technological advancements have made it a popular destination for travelers. Its cultural influence, trade partnerships, and
    CPU times: total: 15.6 ms
    Wall time: 1.03 s

Now we invoke chain again, with the same question.

%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)
.
    South Korea is a country in East Asia, with a population of approximately 55.2 million as of 2023. It borders North Korea to the east, Japan to the northeast, and China to the southeast. The country is known for its advanced technology, leading industries, and significant contributions to South Korean culture. It is often referred to as the "Globe and a Couple" due to its diverse landscapes, rich history, and frontiers with neighboring countries. South Korea's economy is growing, with a strong technological sector and a strong economy, making it a significant player on the global stage. Overall, South Korea is a significant global player, with a rich history, advanced technology, and a cultural influence. With its advanced technology and unique culture, South Korea is a fascinating country to explore. Its diverse landscapes, rich history, and remarkable economic performance have made it a popular destination for travelers. South Korea's contribution to the global economy and its strong technological sector have made it a significant player on the world stage. Its cultural influence and trade partnerships have created a unique culture that is hard to replicate elsewhere. South Korea's diverse landscapes, rich history, and technological advancements have made it a popular destination for travelers. Its cultural influence, trade partnerships, and
    CPU times: total: 0 ns
    Wall time: 2.61 ms

SQLite Cache + Local VLLM

Same as SQLiteCache section above, set SQLiteCache. Note that we set db name to be vllm_cache.db to distinguish from the cache used in SQLiteCache section.

from langchain_community.cache import SQLiteCache
from langchain_core.globals import set_llm_cache
import os

# Create cache directory
if not os.path.exists("cache"):
    os.makedirs("cache")

# Set SQLiteCache
set_llm_cache(SQLiteCache(database_path="cache/vllm_cache.db"))

Invoke chain with local LLM, again, note that we print response not response.content.

%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)
.
    
    South Korea, a nation that prides itself on its history, culture, and natural beauty. Known for its bustling cityscapes, scenic valleys, and delicious cuisine. A major player in South East Asia and a global hub for technology, fashion, and entertainment. Home to industries like electronics, automotive, and media. With a strong economy, South Korea is among the top economies in the world, known for its efficient and inclusive societies. A country that has been a significant player in global politics for decades. The country is also home to many influential figures like Kim Jong-un and Kim Jong-un, who have led North Korea and the country’s military. Known for its national sports, including football (soccer), baseball, and gymnastics. South Korea is also home to many museums, art galleries, and historical sites, showcasing the country’s rich cultural heritage. The country is a leader in technology, with many leading companies based in the South Korean capital, Seoul. The South Korean economy, despite global challenges, continues to be resilient and strong, with an average annual growth rate of 2.5%. The country has a diverse population and is known for its high standard of living, which is a source of pride for many South Koreans. With a strong tradition of education
    CPU times: total: 0 ns
    Wall time: 920 ms

Now we invoke chain again, with the same question.

%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)
.
    
    South Korea, a nation that prides itself on its history, culture, and natural beauty. Known for its bustling cityscapes, scenic valleys, and delicious cuisine. A major player in South East Asia and a global hub for technology, fashion, and entertainment. Home to industries like electronics, automotive, and media. With a strong economy, South Korea is among the top economies in the world, known for its efficient and inclusive societies. A country that has been a significant player in global politics for decades. The country is also home to many influential figures like Kim Jong-un and Kim Jong-un, who have led North Korea and the country’s military. Known for its national sports, including football (soccer), baseball, and gymnastics. South Korea is also home to many museums, art galleries, and historical sites, showcasing the country’s rich cultural heritage. The country is a leader in technology, with many leading companies based in the South Korean capital, Seoul. The South Korean economy, despite global challenges, continues to be resilient and strong, with an average annual growth rate of 2.5%. The country has a diverse population and is known for its high standard of living, which is a source of pride for many South Koreans. With a strong tradition of education
    CPU times: total: 0 ns
    Wall time: 3 ms

Set up the environment. You may refer to for more details.

You can checkout the for more details.

SQLIteCache
InMemoryCache
vLLM
Environment Setup
langchain-opentutorial
Joseph
Teddy Lee
BAEM1N
LangChain Open Tutorial
Overview
Environement Setup
InMemoryCache
SQlite Cache
Setup Local LLM with VLLM
InMemoryCache + Local VLLM
SQLite Cache + Local VLLM