Heuristic Evaluation

Open in Colabarrow-up-right Open in LangChain Academyarrow-up-right

Overview

Heuristic evaluation is a quick and simple method of inference when insufficient time or information makes it impossible to make a perfectly reasonable judgment.

This tutorial describes a heuristic evaluation for text generation and search augmentation generation (RAG) systems, and helps to understand through examples.

The main components covered include:

  1. For work tokenization, tokenization is performed using NLTK's main token function.

  2. Perform a heuristic evaluation based on Rouge, BLEU, METEOR, and SemScore.

    • ROUGE : Used to evaluate the quality of automatic summaries and machine translations.

    • BLEU : Mainly used for machine translation evaluation. Measures how similar the generated text is to the reference text.

    • METEOR : An evaluation index developed to evaluate the quality of machine translation.

    • SemScore : Compares model outputs with gold standard responses using Semantic Textual Similarity (STS).

This guide is designed to help developers and researchers implement and understand these evaluation metrics for assessing the quality of text generation systems, particularly in the context of RAG applications.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setuparrow-up-right for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorialarrow-up-right for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Heuristic Evaluation Based on Rouge, BLEU, METEOR, SemScore

Heuristic evaluation is a reasoning method that can be used quickly and easily when perfect rational judgment is not possible due to insufficient time or information.

(This also has the advantage of saving time and costs when using LLM as Judge.)

Function Definition for RAG Performance Testing

Let's create a RAG system for testing.

Create a function named ask_question. It takes a dictionary named inputs as input and returns a dictionary named answer.

Word Tokenization Using NLTK

Word tokenization is the process of splitting text into individual words or tokens. NLTK (Natural Language Toolkit) provides a robust word tokenization functionality through its word_tokenize function. Main functions of morphological analyzer:

  • word_tokenize : NLTK's main tokenization function

  • nltk.download('punkt') : Downloads required tokenization models

  • split() : Python's basic string splitting

  • word_tokenize() : NLTK's advanced tokenization

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score

  • It is an evaluation metric used to assess the quality of automatic summarization and machine translation.

  • Measures how many important keywords from the reference text are included in the generated text.

  • Calculated based on n-gram overlap.

    Note: What is N-gram?

    image.png

Rouge-1

  • Measures the similarity at the word level.

  • Evaluates the individual word matches between two sentences.

Rouge-2

  • Measures the similarity based on overlapping consecutive word pairs (bigrams).

  • Evaluates the matches of consecutive two-word pairs between two sentences.

Rouge-L

  • Measures the similarity based on the Longest Common Subsequence (LCS).

  • Evaluates the word order at the sentence level without requiring continuous matches.

  • More flexible and ROUGE-N as it can capture longer distance word relationships.

  • Naturally reflects sentence structure similarity by finding the longest sequence that preserves word order but allows gaps.

Prectical Example

Example sentences

  • Original Sentence : "I met a cute dog while jogging in the park this morning."

  • Generated Sentence : "I saw a little cute cat while taking a walk in the park this morning."

  1. ROUGE-1 Analysis

    • Each word is compared individually.

    • Matching words : "I", "a", "cute", "while", "in", "the", "park", "this", "morning"

    • These words appear in both sentences, so they are reflected in the score.

  2. ROUGE-2 Analysis

    • Compares sequences of consecutive word pairs.

    • Matching phrases : "in the", "the park", "park this", "this morning"

    • These two-word combinations appear in both sentences, so they are reflected in the score.

  3. ROUGE-L Analysis

    • Finds the longest common subsequence while maintaining word order.

    • Longest common subsequence : "I a cute while in the park this morning"

    • These words appear in the same order in both sentences, so this sequence is reflected in the ROUGE-L score.

This example demonstrates how each ROUGE metric captures different aspects of similarity:

  • ROUGE-1 captures basic content overlap through individual word matches.

  • ROUGE-2 identifies common phrases and local word order.

  • ROUGE-L evaluates overall sentence structure while allowing for gaps between matched words.

BLEU (Bilingual Evaluation Understudy) Score

BLEU is a metric used to evaluate text generation quality, particularly in machine translation. It measures the similarity between generated text and reference text by comparing the overlap of word sequences (n-grams).

Key Features of BLEU

  1. N-gram Precision

    • BLEU calculates precision from 1-gram (individual words) to 4-gram (sequences of 4 words)

    • The precision measures how many n-grams in the generated text match those in the reference text

    • Higher n-gram matches indicate better phrase-level similarity

  2. Brevity Penalty

    • Imposes a penalty if the generated text is shorter than the reference text.

    • This prevents the system from achieving high precision by only generating short sentences.

  3. Geometric Mean

    • The final BLEU score is the geometric mean of the n-gram precisions multiplied by the brevity penalty.

    • Results in a score between 0 and 1

Example Anaysis

  • Original Sentence : "I met a cute dog while jogging in the park this morning."

  • Generated Sentence : "I saw a little cute cat while taking a walk in the park this morning."

  1. 1-gram(Unigram) Analysis

    • Matching words: "I", "a", "cute", "in", "the", "park", "this", "morning"

    • Precision : 8 / 15 ≈ 0.5333

  2. 2-gram(Bigram) Analysis

    • Matching pairs: "in the", "the park", "this morning"

    • Precision : 3 / 14 ≈ 0.2143

  3. 3-gram(Trigram) Analysis

    • Matching sequences: "in the park", "the park this", "park this morning"

    • Precision : 3 / 13 ≈ 0.2308

  4. 4-gram Analysis

    • Matching sequences: "in the park this", "the park this morning"

    • Precision : 2 / 12 ≈ 0.1667

  5. Brevity Penalty

    • Since the lengths of the two sentences are similar, there is no penalty (1.0).

  6. Final BLEU Score

    • Geometric mean (0.5333, 0.2143, 0.2308, 0.1667) * 1.0

    • The final BLEU score is 0.2531 or about 25.31%.

Limitations

  • Only checks for simple string matches without considering meaning.

  • Does not distinguish the importance of words.

BLEU scores range from 0 to 1, with scores closer to 1 indicating higher quality. However, achieving a perfect score of 1 is very difficult in practice.

METEOR(Metric for Evaluation of Translation with Explicit Ordering) Score

A metric developed to evaluate the quality of machine translation and text generation.

Key Features

  1. Word Matching

    • Exact Matching : Identical words

    • Stem Matching : Words with the same root (e.g., "run" and "running")

    • Synonym Matching : Words with the same meaning (e.g., "quick" and "fast")

    • Paraphrase Matching : Phrase-level synonyms (commonly used in machine translation)

  2. Precision and Recall Analysis

    • Precision : Proportion of words in the generated text that match the reference text

    • Recall : Proportion of words in the reference text that match the generated text

    • F-mean : Harmonic mean of precision and recall

  3. Order Penalty

    • Evaluates word order similarity between texts.

    • Applies penalties for non-consecutive matches.

    • Ensures fluency and natural word ordering.

  4. Weighted Evaluation

    • Assigns different weights to match types (exact, stem, synonym, paraphrase).

    • Allows customization based on evaluation needs.

METEOR Score Calculation Process

  1. Word Matching : Find all possible matches between texts.

  2. Precision and Recall : Calculate based on matched words.

  3. F-mean : Compute harmonic mean of precision and recall.

  4. Order Penalty : Assess word order differences.

  5. Final Score : F-mean × (1 - Order Penalty).

Example

  • Reference : "The cat is on the mat"

  • Generated : "On the mat is a cat"

  1. Word Matching : All content words match("the", "cat", "is", "on", "mat")

  2. Precision & Recall = 1.0 (all words match)

  3. F-mean = 1.0

  4. Order Penalty : 0.1 (due to different word ordering)

  5. Final METEOR Score = 1 * (1 - 0.1) = 0.9

Advantages of METEOR

  1. Recognizes synonyms and word variations.

  2. Balances precision and recall.

  3. Considers word order importance.

  4. Effective with single reference text.

  5. Correlates well with human judgment.

METEOR vs BLEU vs ROUGE

  • METEOR allows for more flexible evaluation by considering semantic similarity of words.

  • It tends to match human judgment better than BLEU.

  • Unlike ROUGE, we explicitly consider word order.

  • METEOR can be more complicated and time consuming to calculate.

SemScore

This research introduces SemScore, an efficient evaluation metric that uses semantic textual similarity (STS) to compare model outputs with reference responses.

After evaluating 12 major instruction-tuned LLMs using 8 common text generation metrics, SemScore demonstrated the highest correlation with human evaluation.

Key Features of SemScore

  1. Semantic Textual Similarity (STS)

    • Measures the semantic similarity between the generated text and the reference text.

    • Considers the overall meaning of the sentences beyond simple word matching.

  2. Utilization of Pre-trained Language Models

    • Uses pre-trained language models such as BERT or RoBERTa to generate sentence embeddings.

    • This allows for better capture of context and meaning.

  3. Multiple Reference Handling

    • Can consider multiple reference answers.

    • This is particularly useful for open-ended questions or creative tasks.

  4. Granular Evaluation

    • Evaluates not only the entire response but also parts of the response (e.g., sentence-level).

  5. High Correlation with Human Evaluation

    • SemScore shows a high correlation with human evaluators' judgments.

Calculation Process

  1. Text Embedding Generation

    • Converts the generated text and reference text into vectors using pre-trained language models.

  2. Similarity Computation

    • Calculates the cosine similarity between the embeddings of the generated text and the reference text.

  3. Selection of Maximum Similarity

    • If there are multiple references, selects the highest similarity score.

  4. Normalization

    • Normalizes the final score to a value between 0 and 1.

Advantages of SemScore

  1. Semantic Understanding

    • Considers the meaning of sentences beyond surface-level word matches.

  2. Flexibility

    • Allows for various forms of answers, making it suitable for creative tasks or open-ended questions.

  3. Context Consideration

    • Uses pre-trained language models to better understand the context of words and sentences.

  4. Multilingual Support

    • Can evaluate multiple languages using multilingual models.

SemScore vs Other Evaluation Metrics

  • Unlike BLEU and ROUGE, it does not rely solely on simple n-gram matching.

  • Measures more advanced semantic similarity compared to METEOR.

  • Similar to BERTScore but specialized for instruction-based tasks.

Uses the SentenceTransformer model to generate sentence embeddings and calculates the cosine similarity between two sentences.

  • The model used in the paper is all-mpnet-base-v2.

The evaluator summarized above is as follows.

The evaluation is conducted using the Heuristic Evaluator.

Check the results.

image.png

Last updated