Caching VLLM

Open in ColabOpen in GitHub

Overview

LangChain provides optional caching layer for LLMs.

This is useful for two reasons:

  • When requesting the same completions multiple times, it can reduce the number of API calls to the LLM provider and thus save costs.

  • By reduing the number of API calls to the LLM provider, it can improve the running time of the application.

But sometimes you need to deploy your own LLM service, like on-premise system where you cannot reach cloud services. In this tutorial, we will use vllm OpenAI compatible API and utilize two kinds of cache, InMemoryCache and SQLiteCache. At end of each section we will compare wall times between before and after caching.

Even though this is a tutorial for local LLM service case, we will remind you about how to use cache with OpenAI API service first.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

InMemoryCache

First, cache the answer to the same question using InMemoryCache.

Now we invoke the chain with the same question.

Note that if we set InMemoryCache again, the cache will be lost and the wall time will increase.

SQLiteCache

Now, we cache the answer to the same question by using SQLiteCache.

Now we invoke the chain with the same question.

Note that if we use SQLiteCache, setting caching again does not delete stored cache.

Setup Local LLM with VLLM

vLLM supports various cases, but for the most stable setup we utilize docker to serve local LLM model with vLLM.

Device & Serving information - Windows

  • CPU : AMD 5600X

  • OS : Windows 10 Pro

  • RAM : 32 Gb

  • GPU : Nividia 3080Ti, 12GB VRAM

  • CUDA : 12.6

  • Driver Version : 560.94

  • Docker Image : nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04

  • model : Qwen/Qwen2.5-0.5B-Instruct

  • Python version : 3.10

  • docker run script :

  • vllm serving script :

InMemoryCache + Local VLLM

Same InMemoryCache section above, we set InMemoryCache.

Invoke chain with local LLM, do note that we print response not response.content

Now we invoke chain again, with the same question.

SQLite Cache + Local VLLM

Same as SQLiteCache section above, set SQLiteCache. Note that we set db name to be vllm_cache.db to distinguish from the cache used in SQLiteCache section.

Invoke chain with local LLM, again, note that we print response not response.content.

Now we invoke chain again, with the same question.

Last updated