LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Define a function for RAG performance testing
  • Set Groundedness Checkers
  • Set UpstageGroundednessCheck
  • Set Custom Groundedness Checker
  • Evaluate Groundedness using Upstage's and Custom Groundedness Checker
  • Comprehensive evaluation of dataset using summary evaluators
  1. 16-Evaluations

Groundedness Evaluation

PreviousSummary EvaluatorsNextPairwise Evaluation

Last updated 4 months ago

  • Author:

  • Peer Review: ,

  • This is a part of

Overview

Groundedness Evaluator an Evaluator that assesses whether answers are accurate based on the given context. This Evaluator can be used to evaluate hallucinations in RAG's responses. In this tutorial, we will look at how to evaluate Groundedness using Upstage Groundedness Checker (UpstageGroundednessCheck) and a custom-made Groundedness Checker.

Table of Contents

References


Environment Setup

[Note]

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "langchain_upstage",
        "pymupdf",
        "faiss-cpu",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Groundedness-Evaluations",
    }
)
Environment variables have been set successfully.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)
True

Define a function for RAG performance testing

Let's create an RAG system that will be used for testing.

from typing import Callable

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


def ask_question_with_llm(llm: ChatOpenAI, file_path: str) -> Callable[[dict], dict]:
    """Create a function that answers questions.
    
    Args:
        llm (ChatOpenAI): Language Model
        file_path (str): Path to the PDF file
    
    Returns:
        (Callable[[dict], dict]): A function that answers questions
    """
    
    # Load documents
    loader = PyMuPDFLoader(file_path)
    docs = loader.load()
    
    # Split given documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
    split_documents = text_splitter.split_documents(docs)

    # Create a retriever
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
    retriever = vectorstore.as_retriever()

    # Create a chain
    prompt = PromptTemplate.from_template(
    """
    You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 

    #Context: 
    {context}

    #Question:
    {question}

    #Answer:
    """
    )

    rag_chain = (
        {
            "context": retriever,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    def _ask_question(inputs: dict) -> dict:
        # Search for context related to the question
        context = retriever.invoke(inputs["question"])
        # Combine the retrieved documents into a single string
        context = "\n".join([doc.page_content for doc in context])
        # Return a dictionary containing the question, context, and answer
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question

gpt_chain = ask_question_with_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    file_path="data/Newwhitepaper_Agents2.pdf",
)

Set Groundedness Checkers

To evaluate groundedness, UpstageGroundednessCheck and Custom Groundedness Checker will be used.

Set UpstageGroundednessCheck

To use Upstage's Groundedness Checker (UpstageGroundednessCheck), you need to obtain an API key from the link below.

from langchain_upstage import UpstageGroundednessCheck

# Create Upstage Groundedness Checker
upstage_groundedness_check = UpstageGroundednessCheck()
# Run Groundedness Checker for evaluation
request_input = {
    "context": "Teddy's gender is male and he operates the Teddynote YouTube channel.",
    "answer": "Teddy is a male.",
}

response = upstage_groundedness_check.invoke(request_input)
print(response)
grounded
# Run Groundedness Checker for evaluation
request_input = {
    "context": "Teddy's gender is male and he operates the Teddynote YouTube channel.",
    "answer": "Teddy is a female.",
}

response = upstage_groundedness_check.invoke(request_input)
print(response)
notGrounded

Define UpstageGroundednessCheck Evaluator. It will be used in evaluate function.

from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate


def upstage_groundedness_check_evaluator(run: Run, example: Example) -> dict:
    # Get generated answer and context
    answer = run.outputs.get("answer", "")
    context = run.outputs.get("context", "")

    # Check groundedness
    groundedness_score = upstage_groundedness_check.invoke(
        {"answer": answer, "context": context}
    )
    groundedness_score = groundedness_score == "grounded"

    return {"key": "groundedness_score", "score": int(groundedness_score)}

Set Custom Groundedness Checker

Create a custom Groundedness Checker using OpenAI's model. For this tutorial, we'll use the target retrieval-answer. If you want to use other targets ('question-answer' or 'question-retrieval'), you should change the description in GroundednessScore and the prompt template in GroundednessChecker.create() accordingly.

from langchain_core.runnables import RunnableSequence
from langchain_openai import ChatOpenAI
from langsmith.schemas import Example, Run
from pydantic import BaseModel, Field


class GroundednessScore(BaseModel):
    """Binary scores for relevance checks"""

    score: str = Field(
        description="relevant or not relevant. Answer 'yes' if the answer is relevant to the retrieved document else answer 'no'"
    )


class GroundednessChecker:
    """This class is used to evaluate the accuracy of a document.
    
    It returns 'yes' or 'no' as the evaluation result.

    Attributes:
        llm (ChatOpenAI): Language Model instance.
    """

    def __init__(self, llm: ChatOpenAI) -> None:
        self.llm = llm

    def create(self) -> RunnableSequence:
        """
        Create a chain for evaluating accuracy.

        Returns:
            Chain: A chain object that can evaluate accuracy
        """
        llm = self.llm.with_structured_output(GroundednessScore)

        # 프롬프트 선택
        template = """You are a grader assessing relevance of a retrieved document to a user question. \n 
            Here is the retrieved document: \n\n {context} \n\n
            Here is the answer: {answer} \n
            If the document contains keyword(s) or semantic meaning related to the user answer, grade it as relevant. \n
            
            Give a binary score 'yes' or 'no' score to indicate whether the retrieved document is relevant to the answer.
        """
        input_vars = ["context", "answer"]

        # Create a prompt
        prompt = PromptTemplate(
            template=template,
            input_variables=input_vars,
        )

        # Create a chain
        chain = prompt | llm
        return chain

# Create a Groundedness Checker
custom_groundedness_check = GroundednessChecker(
        ChatOpenAI(model="gpt-4o-mini", temperature=0)
    ).create()

def custom_groundedness_check_evaluator(run: Run, example: Example) -> dict:
    # Get generated answer and context
    answer = run.outputs.get("answer", "")
    context = run.outputs.get("context", "")

    # Groundedness Check
    groundedness_score = custom_groundedness_check.invoke(
        {"answer": answer, "context": context}
    )
    groundedness_score = groundedness_score.score == "yes"

    return {"key": "groundedness_score", "score": int(groundedness_score)}

Evaluate Groundedness using Upstage's and Custom Groundedness Checker

from langsmith.client import Client

client = Client()
datasets = client.list_datasets()

for dataset in datasets:
    print(dataset.name)
    for example in client.list_examples(dataset_name=dataset.name):
        print("inputs :", example.dict()["inputs"]["question"])
        print("outputs :", example.dict()["outputs"]["answer"])
        print("--------------------------------")
RAG_EVAL_DATASET
    inputs : What are the three targeted learnings to enhance model performance?
    outputs : The three targeted learning approaches to enhance model performance mentioned in the context are:
    
    1. In-context learning: This involves providing a generalized model with a prompt, tools, and few-shot examples at inference time, allowing it to learn "on the fly."
    
    2. Fine-tuning based learning: This method involves training a model using a larger dataset of specific examples prior to inference, helping the model understand when and how to apply certain tools.
    
    3. Using external memory: This includes examples like the 'Example Store' in Vertex AI extensions or data stores in RAG-based architecture, which help models access specific knowledge.
    --------------------------------
    inputs : What are the key functions of an agent's orchestration layer in achieving goals?
    outputs : The key functions of an agent's orchestration layer in achieving goals include structuring reasoning, planning, and decision-making, as well as guiding the agent's actions. It involves taking in information, performing internal reasoning, and generating informed decisions or responses. The orchestration layer can utilize various reasoning techniques such as ReAct, Chain-of-Thought, and Tree-of-Thoughts to facilitate these processes. Additionally, it can leverage language models and external tools to transition through states and execute complex tasks autonomously.
    --------------------------------
    inputs : List up the name of the authors
    outputs : The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.
    --------------------------------
    inputs : What is Tree-of-thoughts?
    outputs : Tree-of-thoughts (ToT) is a prompt engineering framework that is well suited for exploration or strategic lookahead tasks. It generalizes over chain-of-thought prompting and allows the model to explore various thought chains that serve as intermediate steps for general problem solving with language models.
    --------------------------------
    inputs : What is the framework used for reasoning and planning in agent?
    outputs : The frameworks used for reasoning and planning in agents include ReAct, Chain-of-Thought, and Tree-of-Thoughts. These frameworks provide a structure for the orchestration layer to perform internal reasoning and generate informed decisions or responses.
    --------------------------------
    inputs : How do agents differ from standalone language models?
    outputs : Agents can use tools to access real-time data and perform actions, whereas models rely solely on their training data.
    Agents maintain session history for multi-turn reasoning, while models lack native session management.
    Agents incorporate cognitive architectures for advanced reasoning, such as ReAct or Chain-of-Thought.
    --------------------------------
from langsmith.evaluation import evaluate

dataset_name = "RAG_EVAL_DATASET"

# Run evaluation
experiment_results = evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[upstage_groundedness_check_evaluator, custom_groundedness_check_evaluator],
    experiment_prefix="GROUNDEDNESS_EVAL",
    metadata={
        "variant": "Hallucination evaluation using Upstage's and Custom Groundedness Checker",
    },
)
View the evaluation results for experiment: 'GROUNDEDNESS_EVAL-929f91f7' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=fd7bec12-fc0d-41d9-bf21-3db571a54f15
    
    
0it [00:00, ?it/s]

Comprehensive evaluation of dataset using summary evaluators

This is useful when you want to evaluate the entire dataset. (The previous step was for individual data evaluation.)

from typing import List
from langsmith.schemas import Example, Run


def upstage_groundedness_check_summary_evaluator(
    runs: List[Run], examples: List[Example]
) -> dict:
    def is_grounded(run: Run) -> bool:
        context = run.outputs["context"]
        answer = run.outputs["answer"]
        return (
            upstage_groundedness_check.invoke({"context": context, "answer": answer})
            == "grounded"
        )

    groundedness_scores = sum(1 for run in runs if is_grounded(run))
    return {"key": "groundedness_score", "score": groundedness_scores / len(runs)}


def custom_groundedness_check_summary_evaluator(
    runs: List[Run], examples: List[Example]
) -> dict:
    def is_grounded(run: Run) -> bool:
        context = run.outputs["context"]
        answer = run.outputs["answer"]
        return (
            custom_groundedness_check.invoke({"context": context, "answer": answer}).score
            == "yes"
        )

    groundedness_scores = sum(1 for run in runs if is_grounded(run))
    return {"key": "groundedness_score", "score": groundedness_scores / len(runs)}
from langsmith.evaluation import evaluate

# Run evaluation
experiment_result1 = evaluate(
    gpt_chain,
    data=dataset_name,
    summary_evaluators=[
        upstage_groundedness_check_summary_evaluator,
    ],
    experiment_prefix="GROUNDEDNESS_UPSTAGE_SUMMARY_EVAL",
    # Set experiment metadata
    metadata={
        "variant": "Hallucination evaluation using Upstage Groundedness Checker",
    },
)

# Run evaluation
experiment_result2 = evaluate(
    gpt_chain,
    data=dataset_name,
    summary_evaluators=[
        custom_groundedness_check_summary_evaluator,
    ],
    experiment_prefix="GROUNDEDNESS_CUSTOM_SUMMARY_EVAL",
    # Set experiment metadata
    metadata={
        "variant": "Hallucination evaluation using Custom Groundedness Checker",
    },
)
View the evaluation results for experiment: 'GROUNDEDNESS_UPSTAGE_SUMMARY_EVAL-6d673c3a' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=67a47286-2224-48d3-94d6-e2bb87a47aef
    
    
0it [00:00, ?it/s]


View the evaluation results for experiment: 'GROUNDEDNESS_CUSTOM_SUMMARY_EVAL-af449083' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=72ae6cbd-e53b-4311-9c2a-654585da9c5b





0it [00:00, ?it/s]

Setting up your environment is the first step. See the guide for more details.

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the for more details.

Evaluate Groundedness using Upstage's and Custom Groundedness Checker. Before this, check if the dataset exists created before. If you don't have the dataset, create one referring to . In this tutorial, we'll use custom Q&A dataset referred to .

Environment Setup
langchain-opentutorial
Get API Key
04-LangSmith-Dataset
Google Whitepaper on AI Agents
Sungchul Kim
Park Jeong-Ki
BokyungisaGod
LangChain-OpenTutorial
LangSmith Evaluation
Upstage Groundedness Checker
Overview
Environment Setup
Set Groundedness Checkers
Comprehensive evaluation of dataset using summary evaluators
langsmith-groundedness-evaluation-01
langsmith-groundedness-evaluation-02