Caching VLLM
Author: Joseph
Proofread : Two-Jay
This is a part of LangChain Open Tutorial
Overview
LangChain provides optional caching layer for LLMs.
This is useful for two reasons:
When requesting the same completions multiple times, it can reduce the number of API calls to the LLM provider and thus save costs.
By reduing the number of API calls to the LLM provider, it can improve the running time of the application.
But sometimes you need to deploy your own LLM service, like on-premise system where you cannot reach cloud services.
In this tutorial, we will use vllm OpenAI compatible API and utilize two kinds of cache, InMemoryCache and SQLiteCache.
At end of each section we will compare wall times between before and after caching.
Even though this is a tutorial for local LLM service case, we will remind you about how to use cache with OpenAI API service first.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
InMemoryCache
InMemoryCacheFirst, cache the answer to the same question using InMemoryCache.
Now we invoke the chain with the same question.
Note that if we set InMemoryCache again, the cache will be lost and the wall time will increase.
SQLiteCache
SQLiteCacheNow, we cache the answer to the same question by using SQLiteCache.
Now we invoke the chain with the same question.
Note that if we use SQLiteCache, setting caching again does not delete stored cache.
Setup Local LLM with VLLM
VLLMvLLM supports various cases, but for the most stable setup we utilize docker to serve local LLM model with vLLM.
Device & Serving information - Windows
CPU : AMD 5600X
OS : Windows 10 Pro
RAM : 32 Gb
GPU : Nividia 3080Ti, 12GB VRAM
CUDA : 12.6
Driver Version : 560.94
Docker Image : nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
model : Qwen/Qwen2.5-0.5B-Instruct
Python version : 3.10
docker run script :
vllm serving script :
InMemoryCache + Local VLLM
Same InMemoryCache section above, we set InMemoryCache.
Invoke chain with local LLM, do note that we print response not response.content
Now we invoke chain again, with the same question.
SQLite Cache + Local VLLM
Same as SQLiteCache section above, set SQLiteCache.
Note that we set db name to be vllm_cache.db to distinguish from the cache used in SQLiteCache section.
Invoke chain with local LLM, again, note that we print response not response.content.
Now we invoke chain again, with the same question.
Last updated