This tutorial demonstrates the process of backtesting and comparing model evaluations using LangSmith, focusing on assessing RAG system performance between GPT-4 and Ollama models.
Through practical examples, you'll learn how to create evaluation datasets from production data, implement evaluation metrics, and analyze results using LangSmith's comparison features.
Setting up your environment is the first step. See the Environment Setup guide for more details.
[Note]
The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the langchain-opentutorial for more details.
You can set API keys in a .env file or set them manually.
[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.
from dotenv import load_dotenvfrom langchain_opentutorial import set_env# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.ifnotload_dotenv():set_env( {"OPENAI_API_KEY": "","LANGCHAIN_API_KEY": "","LANGCHAIN_TRACING_V2": "true","LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com","LANGCHAIN_PROJECT": "09-CompareEvaluation", } )
Backtesting with LangSmith
Backtesting involves assessing new versions of your application using historical data and comparing the new outputs to the original ones.
Compared to evaluations using pre-production datasets, backtesting offers a clearer indication of whether the new version of your application is an improvement over the current deployment.
Here are the basic steps for backtesting:
Select sample runs from your production tracing project to test against.
Transform the run inputs into a dataset and record the run outputs as an initial experiment against that dataset.
Execute your new system on the new dataset and compare the results of the experiments.
You can easily compare the results of your experiments by utilizing the Compare feature provided by LangSmith.
Define Functions for RAG Performance Testing
Let's create a RAG system to utilize for testing.
from myrag import PDFRAGfrom langchain_openai import ChatOpenAI# Function to answer questionsdefask_question_with_llm(llm):# Create PDFRAG object rag =PDFRAG("data/Newwhitepaper_Agents2.pdf", llm, )# Create retriever retriever = rag.create_retriever()# Create chain rag_chain = rag.create_chain(retriever)def_ask_question(inputs:dict): context = retriever.invoke(inputs["question"]) context ="\n".join([doc.page_content for doc in context])return{"question": inputs["question"],"context": context,"answer": rag_chain.invoke(inputs["question"]),}return _ask_question
This tutorial uses the llama3.2 1b model. Please make sure you have Ollama installed.
For detailed information about Ollama, refer to the GitHub tutorial.
from langchain_ollama import ChatOllamaollama =ChatOllama(model="llama3.2:1b")# Call Ollama modelollama.invoke("Hello?")
AIMessage(content='Hello. Is there something I can help you with or would you like to chat?', additional_kwargs={}, response_metadata={'model': 'llama3.2:1b', 'created_at': '2025-01-15T06:23:09.277041Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1404527875, 'load_duration': 566634125, 'prompt_eval_count': 27, 'prompt_eval_duration': 707000000, 'eval_count': 18, 'eval_duration': 127000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-01a8d197-dc3a-4471-855a-36daac538e8b-0', usage_metadata={'input_tokens': 27, 'output_tokens': 18, 'total_tokens': 45})
Create a function that utilizes the GPT-4o-mini model and the Ollama model to generate answers to your questions.
View the evaluation results for experiment: 'MODEL_COMPARE_EVAL-05b6496b' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=33fa8084-b82f-45ee-a3dd-c374caad16e0
0it [00:00, ?it/s]
View the evaluation results for experiment: 'MODEL_COMPARE_EVAL-c264adb7' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=f784a8c4-88ab-4a35-89a7-3aba5367f182
0it [00:00, ?it/s]
Comparing the result
Use the Compare view to inspect the results.
In the Experiment tab of the dataset, select the experiment you want to compare.