LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • RAG System Setup
  • RAG System Preparation
  • Basic Custom Evaluator
  • Understanding Evaluator Arguments
  • Understanding Output Types
  • Random Score Evaluator Example
  • Custom LLM-as-Judge
  1. 16-Evaluations

LangSmith Custom LLM Evaluation

PreviousEmbedding-based Evaluator(embedding_distance)NextHeuristic Evaluation

Last updated 2 months ago

  • Author:

  • Design:

  • Peer Review:

  • This is a part of

Overview

LangSmith Custom LLM Evaluation is a customizable evaluation framework in LangChain that enables users to assess LLM application outputs based on their specific requirements.

  1. Custom Evaluation Logic:

    • Define your own evaluation criteria

    • Create specific scoring mechanisms

  2. Easy Integration:

    • Works with LangChain's RAG systems

    • Compatible with LangSmith for evaluation tracking

  3. Evaluation Methods:

    • Simple metric-based evaluation

    • Advanced LLM-based assessment

Table of Contents

References


Environment Setup

[Note]

  • The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial pandas
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_openai",
        "pymupdf",
        "faiss-cpu",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "LangSmith-Custom-LLM-Evaluation",
    }
)

Alternatively, you can set and load OPENAI_API_KEY from a .env file.

[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.

from dotenv import load_dotenv

load_dotenv(override=True)
True

RAG System Setup

We will build a basic RAG (Retrieval-Augmented Generation) system to test Custom Evaluators. This implementation creates a question-answering system based on PDF documents, which will serve as our foundation for evaluation purposes.

This RAG system will be used to evaluate answer quality and accuracy through Custom Evaluators in later sections.

RAG System Preparation

  1. Document Processing

    • load_documents(): Loads PDF documents using PyMuPDFLoader

    • split_documents(): Splits documents into appropriate sizes using RecursiveCharacterTextSplitter

  2. Vector Store Creation

    • create_vectorstore(): Creates vector DB using OpenAIEmbeddings and FAISS

    • create_retriever(): Generates a retriever based on the vector store

  3. QA Chain Configuration

    • create_chain(): Creates a chain that answers questions based on retrieved context

    • Includes prompt template for question-answering tasks

from myrag import PDFRAG
from langchain_openai import ChatOpenAI

# Create PDFRAG object
rag = PDFRAG(
    "data/Newwhitepaper_Agents2.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# Create Retriever
retriever = rag.create_retriever()

# Create Chain
chain = rag.create_chain(retriever)

# Generate answer for question
chain.invoke("List up the name of the authors")
'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'

We'll create a function called ask_question that takes a dictionary inputs as a parameter and returns a dictionary with an answer key. This function will serve as our question-answering interface.

# Create function to answer question
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

Basic Custom Evaluator

Let's explore the fundamental concepts of creating Custom Evaluators. Custom Evaluators are evaluation tools in LangChain's LangSmith evaluation system that users can define according to their specific requirements. LangSmith provides a comprehensive platform for monitoring, evaluating, and improving LLM applications.

Understanding Evaluator Arguments

Custom Evaluator functions can use the following arguments:

  • run (Run): The complete Run object generated by the application

  • example (Example): Dataset example containing inputs, outputs, and metadata

  • inputs (dict): Input dictionary for a single example from the dataset

  • outputs (dict): Output dictionary generated by the application for given inputs

  • reference_outputs (dict): Reference output dictionary associated with the example

In most cases, inputs, outputs, and reference_outputs are sufficient. The run and example objects are only needed when additional metadata is required.

Understanding Output Types

Custom Evaluators can return results in the following formats:

  1. Dictionary Format (Recommended)

"{\"key\": \"metric_name\", \"score\": value}"
'{"key": "metric_name", "score": value}'
  1. Basic Types (Python)

    • int, float, bool: Continuous numerical metrics

    • str: Categorical metrics

  2. Multiple Metrics

"[{\"key\": \"metric1\", \"score\": value1}, {\"key\": \"metric2\", \"score\": value2}]"
'[{"key": "metric1", "score": value1}, {"key": "metric2", "score": value2}]'

Random Score Evaluator Example

Now, let's create a simple Custom Evaluator example. This evaluator will return a random score between 1 and 10, regardless of the answer content.

Random Score Evaluator Implementation

  • Takes Run and Example objects as input parameters

  • Returns a dictionary in the format: {\"key\": \"random_score\", \"score\": score}

Here's the basic implementation of a random score evaluator:

from langsmith.schemas import Run, Example
import random


def random_score_evaluator(run: Run, example: Example) -> dict:
    # Return random score
    score = random.randint(1, 10)
    return {"key": "random_score", "score": score}
from langsmith.evaluation import evaluate

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# Run
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[random_score_evaluator],
    experiment_prefix="CUSTOM-EVAL",
    # Set experiment metadata
    metadata={
        "variant": "Random Score Evaluator",
    },
)
View the evaluation results for experiment: 'CUSTOM-EVAL-565330e1' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=d0296986-a186-4dc6-a327-659c1e00169c
    
    
0it [00:00, ?it/s]
experiment_results.to_pandas()
inputs.question
outputs.answer
error
reference.answer
feedback.random_score
execution_time
example_id
id

0

What are the three targeted learnings to enhan...

The three targeted learnings to enhance model ...

None

The three targeted learning approaches to enha...

4

3.112384

0e661de4-636b-425d-8f6e-0a52b8070576

ae36f6a7-86a2-4f0a-89d2-8be9671ca3cb

1

What are the key functions of an agent's orche...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

6

4.077394

3561c6fe-6ed4-4182-989a-270dcd635f32

6c65f286-a103-4a60-b906-555fd405ea7e

2

List up the name of the authors

The authors are Julia Wiesinger, Patrick Marlo...

None

The authors are Julia Wiesinger, Patrick Marlo...

7

1.172011

b03e98d1-44ad-4142-8dfa-7b0a31a57096

429dad1e-f68c-4f67-ae36-cc2171c4c6a0

3

What is Tree-of-thoughts?

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

5

1.374912

be18ec98-ab18-4f30-9205-e75f1cb70844

be337bef-90b0-4b6a-b9ab-941562ab4b44

4

What is the framework used for reasoning and p...

The frameworks used for reasoning and planning...

None

The frameworks used for reasoning and planning...

7

1.821961

eb4b29a7-511c-4f78-a08f-2d5afeb84320

9cff92b1-04e7-49f5-ab2a-85763468e6cb

5

How do agents differ from standalone language ...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

1

2.135424

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

3fbe6fa6-88bf-46de-bdfa-0f39eac18c78

Custom LLM-as-Judge

Now, we'll create a LLM Chain to use as an evaluator.

First, let's define a function that returns context, answer, and question:

# Function to return RAG results with `context`, `answer`, and `question`
def context_answer_rag_answer(inputs: dict):
    # Get context from Vector Store Retriever
    context = retriever.invoke(inputs["question"])
    # Get answer from RAG Chain in PDFRAG
    answer = chain.invoke(inputs["question"])
    return {
        "context": "\n".join([doc.page_content for doc in context]),
        "answer": answer,
        "question": inputs["question"],
    }

Let's run our evaluation using LangSmith's evaluate function. We'll use our custom evaluator to assess the RAG system's performance across our test dataset.

We'll use the teddynote/context-answer-evaluator prompt template from LangChain Hub, which provides a structured evaluation framework for RAG systems.

The evaluator uses the following criteria:

  • Accuracy (0-10): How well the answer aligns with the context

  • Comprehensiveness (0-10): How complete and detailed the answer is

  • Context Precision (0-10): How effectively the context information is used

The final score is normalized to a 0-1 scale using the formula:Final Score = (Accuracy + Comprehensiveness + Context Precision) / 30

This evaluation framework helps us quantitatively assess the quality of our RAG system's responses.

from langchain import hub

# Get evaluator Prompt
llm_evaluator_prompt = hub.pull("teddynote/context-answer-evaluator")
llm_evaluator_prompt.pretty_print()
    As an LLM evaluator (judge), please assess the LLM's response to the given question. Evaluate the response's accuracy, comprehensiveness, and context precision based on the provided context. After your evaluation, return only the numerical scores in the following format:
    Accuracy: [score]
    Comprehensiveness: [score]
    Context Precision: [score]
    Final: [normalized score]
    Grading rubric:
    
    Accuracy (0-10 points):
    Evaluate how well the answer aligns with the information provided in the given context.
    
    0 points: The answer is completely inaccurate or contradicts the provided context
    4 points: The answer partially aligns with the context but contains significant inaccuracies
    7 points: The answer mostly aligns with the context but has minor inaccuracies or omissions
    10 points: The answer fully aligns with the provided context and is completely accurate
    
    
    Comprehensiveness (0-10 points):
    
    0 points: The answer is completely inadequate or irrelevant
    3 points: The answer is accurate but too brief to fully address the question
    7 points: The answer covers main aspects but lacks detail or misses minor points
    10 points: The answer comprehensively covers all aspects of the question
    
    
    Context Precision (0-10 points):
    Evaluate how precisely the answer uses the information from the provided context.
    
    0 points: The answer doesn't use any information from the context or uses it entirely incorrectly
    4 points: The answer uses some information from the context but with significant misinterpretations
    7 points: The answer uses most of the relevant context information correctly but with minor misinterpretations
    10 points: The answer precisely and correctly uses all relevant information from the context
    
    
    Final Normalized Score:
    Calculate by summing the scores for accuracy, comprehensiveness, and context precision, then dividing by 30 to get a score between 0 and 1.
    Formula: (Accuracy + Comprehensiveness + Context Precision) / 30
    
    #Given question:
    {question}
    
    #LLM's response:
    {answer}
    
    #Provided context:
    {context}
    
    Please evaluate the LLM's response according to the criteria above. 
    
    In your output, include only the numerical scores for FINAL NORMALIZED SCORE without any additional explanation or reasoning.
    ex) 0.81
    
    #Final Normalized Score(Just the number):
    
    
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# Create evaluator
custom_llm_evaluator = (
    llm_evaluator_prompt
    | ChatOpenAI(temperature=0.0, model="gpt-4o-mini")
    | StrOutputParser()
)

Let's evaluate our system using the previously created context_answer_rag_answer function. We'll pass the generated answer and context to our custom_llm_evaluator for assessment.

# Generate answer
output = context_answer_rag_answer(
    {"question": "What are the three targeted learnings to enhance model performance?"}
)

# Run evaluator
custom_llm_evaluator.invoke(output)
'0.87'

Let's define our custom_evaluator function.

  • run.outputs: Gets the answer, context, and question generated by the RAG chain

  • example.outputs: Gets the reference answer from our dataset

from langsmith.schemas import Run, Example

def custom_evaluator(run: Run, example: Example) -> dict:
    # Get LLM generated answer and reference answer
    llm_answer = run.outputs.get("answer", "")
    context = run.outputs.get("context", "")
    question = example.outputs.get("question", "")

    # Return custom score
    score = custom_llm_evaluator.invoke(
        {"question": question, "answer": llm_answer, "context": context}
    )
    return {"key": "custom_score", "score": float(score)}

Let's run our evaluation using LangSmith's evaluate function.

from langsmith.evaluation import evaluate

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# Run
experiment_results = evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[custom_evaluator],
    experiment_prefix="CUSTOM-LLM-EVAL",
    # Set experiment metadata
    metadata={
        "variant": "Evaluation using Custom LLM Evaluator",
    },
)
View the evaluation results for experiment: 'CUSTOM-LLM-EVAL-e33ee0a7' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=156ad2c4-b8ec-4ada-b76c-44b09a527b50
    
    
0it [00:00, ?it/s]
experiment_results.to_pandas()
inputs.question
outputs.context
outputs.answer
outputs.question
error
reference.answer
feedback.custom_score
execution_time
example_id
id

0

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

What are the three targeted learnings to enhan...

None

The three targeted learning approaches to enha...

0.87

3.603254

0e661de4-636b-425d-8f6e-0a52b8070576

85ddbfcb-8c49-4551-890a-f137d7b413b8

1

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

What are the key functions of an agent's orche...

None

The key functions of an agent's orchestration ...

0.93

4.028933

3561c6fe-6ed4-4182-989a-270dcd635f32

0b423bb6-c722-41af-ae6e-c193ebc3ff8a

2

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors are Julia Wiesinger, Patrick Marlo...

List up the name of the authors

None

The authors are Julia Wiesinger, Patrick Marlo...

0.87

1.885114

b03e98d1-44ad-4142-8dfa-7b0a31a57096

54e0987b-502f-48a7-877f-4b3d56bd82cf

3

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

What is Tree-of-thoughts?

None

Tree-of-thoughts (ToT) is a prompt engineering...

0.87

1.732563

be18ec98-ab18-4f30-9205-e75f1cb70844

f0b02411-b377-4eaa-821a-2108b8b4836f

4

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

What is the framework used for reasoning and p...

None

The frameworks used for reasoning and planning...

0.83

2.651672

eb4b29a7-511c-4f78-a08f-2d5afeb84320

38d34eb6-1ec5-44ea-a7d0-c7c98d46b0bc

5

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

How do agents differ from standalone language ...

None

Agents can use tools to access real-time data ...

0.93

2.519094

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

49b26b38-e499-4c71-bdcb-eccfa44a1beb

Setting up your environment is the first step. See the guide for more details.

Check out the for more details.

LangChain Get started with LangSmith
LangChain How to define a custom evaluator
Environment Setup
langchain-opentutorial
Overview
Environment Setup
RAG System Setup
Basic Custom Evaluator
Custom LLM-as-Judge
HeeWung Song(Dan)
LangChain Open Tutorial
CUSTOM-EVAL-FOR-RANDOM-SCORE
CUSTOM-EVAL-FOR-CUSTOM-SCORE