Synthetic Dataset Generation using RAG

Open in Colab Open in GitHub

Overview

This tutorial covers an example of generating a synthetic dataset using RAG. Typically, it is used to create evaluation datasets for Domain Specific RAG pipelines or to generate synthetic data for model training. This tutorial will focus on the following two features. While the structure is the same, their intended use and purpose differ.

Features

  • Domain Specific RAG Evaluation Dataset : Generates a domain specific synthetic dataset (Context, Question, Answer) for evaluating the RAG pipeline.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Domain Specific RAG Evaluation Dataset

Generates a synthetic dataset (Context, Question, Answer) for evaluating the Domain Specific RAG pipeline.

  • Context: A context randomly selected from documents in a specific domain is used as the ground truth.

  • Question: A question that can be answered using the Context.

  • Answer: An answer generated based on the Context and the Question.

Question Generating Prompt

A prompt for generating questions from a given context using the RAG (Retriever Augmented Generation) technique is structured as follows. It consists of four main sections—Instruction, Requirements, Style, and Example—along with an Indicator section where actual variable values are mapped. Each section is explained below:

  • Instruction: Provides overall guidance for the prompt, including the purpose of the task and an explanation of the structured prompt sections.

  • Requirements: Lists essential conditions that must be met when performing the task.

  • Style: Specifies the stylistic guidelines for the generated output.

  • Example: Includes actual execution examples.

Question Evolving Prompt

A prompt that utilizes the RAG (Retriever Augmented Generation) technique to correct inaccurate information or generate more evolved questions based on a given context and a draft question is structured as follows. It consists of four main sections—Instruction, Evolving, and Example along with an Indicator section where actual variable values are mapped. Each section is explained below:

  • Instruction: Provides the overall guidance for the prompt.

  • Evolving: Contains step-by-step instructions to achieve the purpose of the prompt.

  • Example: Includes actual execution examples.

Answer Generating Prompt

A prompt that uses the RAG (Retriever Augmented Generation) technique to generate the final answer based on a given context and question is structured as follows. It consists of two main sections—Instruction and Example—along with an Indicator section where actual variable values are mapped. Each section is explained below:

  • Instruction: Provides the overall guidance for the prompt.

  • Example: Includes actual execution examples.

SyntheticGenerator

Based on the above tutorial, I have written it as a single class called SyntheticGenerator(). The overall flow is the same as the tutorial, and it can be used by executing the run() method. By specifying the desired path in save_path, the generated data will be saved as a CSV file, with each (Context, Query, Answer) triple-pair stored in a single row.

Last updated