Translation

Open in ColabOpen in GitHub

Overview

This tutorial compares two approaches to translating Chinese text into English using LangChain.

The first approach utilizes a single LLM (e.g. GPT-4) to generate a straightforward translation. The second approach employs Retrieval-Augmented Generation (RAG), which enhances translation accuracy by retrieving relevant documents.

The tutorial evaluates the translation accuracy and performance of each method, helping users choose the most suitable approach for their needs.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can check out the langchain-opentutorial for more details.

Load sample text and output the content.

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.

Translation using LLM

Translation using LLM refers to using a large language model (LLM), such as GPT-4, to translate text from one language to another. The model processes the input text and generates a direct translation based on its pre-trained knowledge. This approach is simple, fast, and effective.

Translation using RAG

Translation using RAG (Retrieval-Augmented Generation) enhances translation accuracy by combining a pre-trained LLM with a retrieval mechanism. This approach first retrieves relevant documents or data related to the input text and then utilizes this additional context to generate a more precise and contextually accurate translation.

Simple Search Implementation Using FAISS

In this implementation, we use a vector database to store and retrieve embedded representations of entire sentences. Instead of relying solely on predefined knowledge in the LLM, our approach allows the model to retrieve semantically relevant sentences from the vector database, improving the translation's accuracy and fluency.

FAISS (Facebook AI Similarity Search)

FAISS is a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It is widely used for approximate nearest neighbor (ANN) search in large-scale datasets.

Let's compare translation using LLM and translation using RAG.

First, write the necessary functions.

Use the written functions to perform the comparison.

Evaluation of translation results

Evaluating machine translation quality is essential to ensure the accuracy and fluency of translated text. In this tutorial, we use two key metrics, TER and BERTScore, to assess the quality of translations produced by both a general LLM-based translation system and a RAG-based translation system.

By combining TER and BERTScore, we achieve a comprehensive evaluation of translation quality. TER measures the structural differences and required edits between translations and reference texts. BERTScore captures the semantic similarity between translations and references. This dual evaluation approach allows us to effectively compare LLM and RAG translations, helping determine which method provides more accurate, fluent, and natural translations.

TER (Translation Edit Rate)

TER quantifies how much editing is required to transform a system-generated translation into the reference translation. It accounts for insertions, deletions, substitutions, and Shifts (word reordering).

Interpretation: Lower TER indicates a better translation (fewer modifications needed). Higher TER suggests that the translation deviates significantly from the reference

BERTScore - Contextual Semantic Evaluation

BERTScore evaluates translation quality by computing semantic similarity scores between reference and candidate translations. It utilizes contextual embeddings from a pre-trained BERT model, unlike traditional n-gram-based methods that focus solely on word overlap.

Interpretation: Higher BERTScore (closer to 1.0) indicates better semantic similarity between the candidate and reference translations. Lower scores indicate less semantic alignment with the reference translation.

Since Chinese and English are grammatically very different languages, there can be significant differences in word order and sentence structure. As a result, the TER score may be relatively high, while BERTScore can serve as a more important evaluation metric.

By leveraging both TER and BERTScore, we can effectively analyze the strengths and weaknesses of LLM-based and RAG-based translation methods.

Last updated