Heuristic evaluation is a quick and simple method of inference when insufficient time or information makes it impossible to make a perfectly reasonable judgment.
This tutorial describes a heuristic evaluation for text generation and search augmentation generation (RAG) systems, and helps to understand through examples.
The main components covered include:
For work tokenization, tokenization is performed using NLTK's main token function.
Perform a heuristic evaluation based on Rouge, BLEU, METOR, and SemScore.
ROUGE : Used to evaluate the quality of automatic summaries and machine translations.
BLEU : Mainly used for machine translation evaluation. Measures how similar the generated text is to the reference text.
METEOR : An evaluation index developed to evaluate the quality of machine translation.
SemScore : Compares model outputs with gold standard responses using Semantic Textual Similarity (STS).
This guide is designed to help developers and researchers implement and understand these evaluation metrics for assessing the quality of text generation systems, particularly in the context of RAG applications.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
# Load API keys from .env filefrom dotenv import load_dotenvload_dotenv(override=True)
True
Heuristic Evaluation Based on Rouge, BLEU, METOR, SemScore
Heuristic evaluation is a reasoning method that can be used quickly and easily when perfect rational judgment is not possible due to insufficient time or information.
(This also has the advantage of saving time and costs when using LLM as Judge.)
Function Definition for RAG Performance Testing
Let's create a RAG system for testing.
from myrag import PDFRAGfrom langchain_openai import ChatOpenAI# Create PDFRAG objectrag =PDFRAG("data/Newwhitepaper_Agents2.pdf",ChatOpenAI(model="gpt-4o-mini", temperature=0),)# Create retrieverretriever = rag.create_retriever()# Create chainchain = rag.create_chain(retriever)# Generate answer to questionchain.invoke("List up the name of the authors")
'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'
Create a function named ask_question. It takes a dictionary named inputs as input and returns a dictionary named answer.
# Create a function to answer questionsdefask_question(inputs:dict):return{"answer": chain.invoke(inputs["question"])}
Word Tokenization Using NLTK
Word tokenization is the process of splitting text into individual words or tokens. NLTK (Natural Language Toolkit) provides a robust word tokenization functionality through its word_tokenize function. Main functions of morphological analyzer:
from nltk.tokenize import word_tokenizeimport nltk# Download required NLTK data (run once)nltk.download('punkt_tab', quiet=True)sent1 ="Hello. Nice meet you! My name is Chloe~~."sent2 ="Hello. My name is Chloe~^^. Nice meet you!!"# Basic string splitprint(sent1.split())print(sent2.split())print("==="*20)# NLTK word tokenizationprint(word_tokenize(sent1))print(word_tokenize(sent2))
These words appear in both sentences, so they are reflected in the score.
ROUGE-2 Analysis
Compares sequences of consecutive word pairs.
Matching phrases : "in the", "the park", "park this", "this morning"
These two-word combinations appear in both sentences, so they are reflected in the score.
ROUGE-L Analysis
Finds the longest common subsequence while maintaining word order.
Longest common subsequence : "I a cute while in the park this morning"
These words appear in the same order in both sentences, so this sequence is reflected in the ROUGE-L score.
This example demonstrates how each ROUGE metric captures different aspects of similarity:
ROUGE-1 captures basic content overlap through individual word matches.
ROUGE-2 identifies common phrases and local word order.
ROUGE-L evaluates overall sentence structure while allowing for gaps between matched words.
from rouge_score import rouge_scorersent1 ="I met a cute dog while jogging in the park this morning."sent2 ="I saw a little cute cat while taking a walk in the park this morning."sent3 ="I saw a little and cute dog on the park bench this morning."# Define custom tokenizer classclassNLTKTokenizer:deftokenize(self,text):returnword_tokenize(text.lower())# Initialize RougeScorer with the NLTK tokenizer classscorer = rouge_scorer.RougeScorer( ["rouge1", "rouge2", "rougeL"], use_stemmer=False, tokenizer=NLTKTokenizer())# Compare first pair of sentencesprint(f"[1] {sent1}\n[2] {sent2}\n[rouge1] {scorer.score(sent1, sent2)['rouge1'].fmeasure:.5f}\n[rouge2] {scorer.score(sent1, sent2)['rouge2'].fmeasure:.5f}\n[rougeL] {scorer.score(sent1, sent2)['rougeL'].fmeasure:.5f}")print("==="*20)# Compare second pair of sentencesprint(f"[1] {sent1}\n[2] {sent3}\n[rouge1] {scorer.score(sent1, sent3)['rouge1'].fmeasure:.5f}\n[rouge2] {scorer.score(sent1, sent2)['rouge2'].fmeasure:.5f}\n[rougeL] {scorer.score(sent1, sent3)['rougeL'].fmeasure:.5f}")
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little cute cat while taking a walk in the park this morning.
[rouge1] 0.68966
[rouge2] 0.37037
[rougeL] 0.68966
============================================================
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little and cute dog on the park bench this morning.
[rouge1] 0.66667
[rouge2] 0.37037
[rougeL] 0.66667
BLEU (Bilingual Evaluation Understudy) Score
BLEU is a metric used to evaluate text generation quality, particularly in machine translation. It measures the similarity between generated text and reference text by comparing the overlap of word sequences (n-grams).
Key Features of BLEU
N-gram Precision
BLEU calculates precision from 1-gram (individual words) to 4-gram (sequences of 4 words)
The precision measures how many n-grams in the generated text match those in the reference text
Matching pairs: "in the", "the park", "this morning"
Precision : 3 / 14 ≈ 0.2143
3-gram(Trigram) Analysis
Matching sequences: "in the park", "the park this", "park this morning"
Precision : 3 / 13 ≈ 0.2308
4-gram Analysis
Matching sequences: "in the park this", "the park this morning"
Precision : 2 / 12 ≈ 0.1667
Brevity Penalty
Since the lengths of the two sentences are similar, there is no penalty (1.0).
Final BLEU Score
Geometric mean (0.5333, 0.2143, 0.2308, 0.1667) * 1.0
The final BLEU score is 0.2531 or about 25.31%.
Limitations
Only checks for simple string matches without considering meaning.
Does not distinguish the importance of words.
BLEU scores range from 0 to 1, with scores closer to 1 indicating higher quality. However, achieving a perfect score of 1 is very difficult in practice.
from nltk.translate.bleu_score import sentence_bleusent1 ="I met a cute dog while jogging in the park this morning."sent2 ="I saw a little cute cat while taking a walk in the park this morning."sent3 ="I saw a little and cute dog on the park bench this morning."# tokenizationreference =word_tokenize(sent1)candidate1 =word_tokenize(sent2)candidate2 =word_tokenize(sent3)print("Sentence 1 tokens:", reference)print("Sentence 2 tokens:", candidate1)print("Sentence 3 tokens:", candidate2)
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction# Initialize smoothing functionsmoothie =SmoothingFunction().method1# Calculate and print BLEU score for first pairbleu_score =sentence_bleu( [word_tokenize(sent1)],word_tokenize(sent2), smoothing_function=smoothie)print(f"[1] {sent1}\n[2] {sent2}\n[score] {bleu_score:.5f}")print("==="*20)# Calculate and print BLEU score for second pairbleu_score =sentence_bleu( [word_tokenize(sent1)],word_tokenize(sent3), smoothing_function=smoothie)print(f"[1] {sent1}\n[2] {sent3}\n[score] {bleu_score:.5f}")
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little cute cat while taking a walk in the park this morning.
[score] 0.34235
============================================================
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little and cute dog on the park bench this morning.
[score] 0.11064
METEOR(Metric for Evaluation of Translation with Explicit Ordering) Score
A metric developed to evaluate the quality of machine translation and text generation.
Key Features
Word Matching
Exact Matching : Identical words
Stem Matching : Words with the same root (e.g., "run" and "running")
Synonym Matching : Words with the same meaning (e.g., "quick" and "fast")
Paraphrase Matching : Phrase-level synonyms (commonly used in machine translation)
Precision and Recall Analysis
Precision : Proportion of words in the generated text that match the reference text
Recall : Proportion of words in the reference text that match the generated text
F-mean : Harmonic mean of precision and recall
Order Penalty
Evaluates word order similarity between texts.
Applies penalties for non-consecutive matches.
Ensures fluency and natural word ordering.
Weighted Evaluation
Assigns different weights to match types (exact, stem, synonym, paraphrase).
Allows customization based on evaluation needs.
METEOR Score Calculation Process
Word Matching : Find all possible matches between texts.
Precision and Recall : Calculate based on matched words.
F-mean : Compute harmonic mean of precision and recall.
Order Penalty : Assess word order differences.
Final Score : F-mean × (1 - Order Penalty).
Example
Reference : "The cat is on the mat"
Generated : "On the mat is a cat"
Word Matching : All content words match("the", "cat", "is", "on", "mat")
Precision & Recall = 1.0 (all words match)
F-mean = 1.0
Order Penalty : 0.1 (due to different word ordering)
Final METEOR Score = 1 * (1 - 0.1) = 0.9
Advantages of METEOR
Recognizes synonyms and word variations.
Balances precision and recall.
Considers word order importance.
Effective with single reference text.
Correlates well with human judgment.
METEOR vs BLEU vs ROUGE
METEOR allows for more flexible evaluation by considering semantic similarity of words.
It tends to match human judgment better than BLEU.
Unlike ROUGE, we explicitly consider word order.
METEOR can be more complicated and time consuming to calculate.
import nltkimport warningsfrom nltk.corpus import wordnet as wn# Suppress NLTK download messageswith warnings.catch_warnings(): warnings.filterwarnings("ignore") nltk.download('wordnet', quiet=True)# Import and ensure WordNet is loadedwn.ensure_loaded()
import nltkimport warningsfrom nltk.translate import meteor_score# Suppress NLTK download messageswith warnings.catch_warnings(): warnings.filterwarnings("ignore") nltk.download('punkt_tab', quiet=True)sent1 ="I met a cute dog while jogging in the park this morning."sent2 ="I saw a little cute cat while taking a walk in the park this morning."sent3 ="I saw a little and cute dog on the park bench this morning."# Calculate METEOR score for first pairmeteor = meteor_score.meteor_score( [word_tokenize(sent1)],word_tokenize(sent2),)print(f"[1] {sent1}\n[2] {sent2}\n[score] {meteor:.5f}")print("==="*20)# Calculate METEOR score for second pairmeteor = meteor_score.meteor_score( [word_tokenize(sent1)],word_tokenize(sent3),)print(f"[1] {sent1}\n[2] {sent3}\n[score] {meteor:.5f}")
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little cute cat while taking a walk in the park this morning.
[score] 0.70489
============================================================
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little and cute dog on the park bench this morning.
[score] 0.62812
This research introduces SemScore, an efficient evaluation metric that uses semantic textual similarity (STS) to compare model outputs with reference responses.
After evaluating 12 major instruction-tuned LLMs using 8 common text generation metrics, SemScore demonstrated the highest correlation with human evaluation.
Key Features of SemScore
Semantic Textual Similarity (STS)
Measures the semantic similarity between the generated text and the reference text.
Considers the overall meaning of the sentences beyond simple word matching.
Utilization of Pre-trained Language Models
Uses pre-trained language models such as BERT or RoBERTa to generate sentence embeddings.
This allows for better capture of context and meaning.
Multiple Reference Handling
Can consider multiple reference answers.
This is particularly useful for open-ended questions or creative tasks.
Granular Evaluation
Evaluates not only the entire response but also parts of the response (e.g., sentence-level).
High Correlation with Human Evaluation
SemScore shows a high correlation with human evaluators' judgments.
Calculation Process
Text Embedding Generation
Converts the generated text and reference text into vectors using pre-trained language models.
Similarity Computation
Calculates the cosine similarity between the embeddings of the generated text and the reference text.
Selection of Maximum Similarity
If there are multiple references, selects the highest similarity score.
Normalization
Normalizes the final score to a value between 0 and 1.
Advantages of SemScore
Semantic Understanding
Considers the meaning of sentences beyond surface-level word matches.
Flexibility
Allows for various forms of answers, making it suitable for creative tasks or open-ended questions.
Context Consideration
Uses pre-trained language models to better understand the context of words and sentences.
Multilingual Support
Can evaluate multiple languages using multilingual models.
SemScore vs Other Evaluation Metrics
Unlike BLEU and ROUGE, it does not rely solely on simple n-gram matching.
Measures more advanced semantic similarity compared to METEOR.
Similar to BERTScore but specialized for instruction-based tasks.
Uses the SentenceTransformer model to generate sentence embeddings and calculates the cosine similarity between two sentences.
The model used in the paper is all-mpnet-base-v2.
from sentence_transformers import SentenceTransformer, utilimport warningsimport loggingfrom tqdm import tqdm# Suppress all warnings and logging messageswarnings.filterwarnings("ignore", category=UserWarning)warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", category=DeprecationWarning)warnings.filterwarnings("ignore", category=Warning)# Suppress all warning types# Disable tqdm warningsimport tqdm.autonotebooktqdm.autonotebook.tqdm = tqdm.std.tqdm# Configure logginglogging.getLogger("sentence_transformers").setLevel(logging.ERROR)logging.getLogger("transformers").setLevel(logging.ERROR)logging.getLogger("tqdm").setLevel(logging.ERROR)sent1 ="I met a cute dog while jogging in the park this morning."sent2 ="I saw a little cute cat while taking a walk in the park this morning."sent3 ="I saw a little and cute dog on the park bench this morning."# Load SentenceTransformer model silentlymodel =SentenceTransformer("all-mpnet-base-v2", use_auth_token=False)# Encoding sentencessent1_encoded = model.encode(sent1, convert_to_tensor=True, show_progress_bar=False)sent2_encoded = model.encode(sent2, convert_to_tensor=True, show_progress_bar=False)sent3_encoded = model.encode(sent3, convert_to_tensor=True, show_progress_bar=False)# Calculate and print cosine similarity for first paircosine_similarity = util.pytorch_cos_sim(sent1_encoded, sent2_encoded).item()print(f"[1] {sent1}\n[2] {sent2}\n[score] {cosine_similarity:.5f}")print("==="*20)# Calculate and print cosine similarity for second paircosine_similarity = util.pytorch_cos_sim(sent1_encoded, sent3_encoded).item()print(f"[1] {sent1}\n[2] {sent3}\n[score] {cosine_similarity:.5f}")
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little cute cat while taking a walk in the park this morning.
[score] 0.69124
============================================================
[1] I met a cute dog while jogging in the park this morning.
[2] I saw a little and cute dog on the park bench this morning.
[score] 0.76015
View the evaluation results for experiment: 'Heuristic-EVAL-201c2ddf' at:
https://smith.langchain.com/o/97d4ef95-7b86-4c82-9f4c-3f18e315c9b2/datasets/920886b5-0aa2-4f47-b23f-b3dfc33135ef/compare?selectedSessions=048bce87-ae35-4fd2-af16-6c738fc93762