Synthetic Dataset Generation using RAG
Author: Ash-hun
Design:
Peer Review: syshin0116, Kane
This is a part of LangChain Open Tutorial
Overview
This tutorial covers an example of generating a synthetic dataset using RAG. Typically, it is used to create evaluation datasets for Domain Specific RAG pipelines or to generate synthetic data for model training. This tutorial will focus on the following two features. While the structure is the same, their intended use and purpose differ.
Features
Domain Specific RAG Evaluation Dataset : Generates a domain specific synthetic dataset (Context, Question, Answer) for evaluating the RAG pipeline.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
Domain Specific RAG Evaluation Dataset
Generates a synthetic dataset (Context, Question, Answer) for evaluating the Domain Specific RAG pipeline.
Context: A context randomly selected from documents in a specific domain is used as the ground truth.Question: A question that can be answered using theContext.Answer: An answer generated based on theContextand theQuestion.
Question Generating Prompt
A prompt for generating questions from a given context using the RAG (Retriever Augmented Generation) technique is structured as follows.
It consists of four main sections—Instruction, Requirements, Style, and Example—along with an Indicator section where actual variable values are mapped. Each section is explained below:
Instruction: Provides overall guidance for the prompt, including the purpose of the task and an explanation of the structured prompt sections.Requirements: Lists essential conditions that must be met when performing the task.Style: Specifies the stylistic guidelines for the generated output.Example: Includes actual execution examples.
Question Evolving Prompt
A prompt that utilizes the RAG (Retriever Augmented Generation) technique to correct inaccurate information or generate more evolved questions based on a given context and a draft question is structured as follows.
It consists of four main sections—Instruction, Evolving, and Example along with an Indicator section where actual variable values are mapped. Each section is explained below:
Instruction: Provides the overall guidance for the prompt.Evolving: Contains step-by-step instructions to achieve the purpose of the prompt.Example: Includes actual execution examples.
Answer Generating Prompt
A prompt that uses the RAG (Retriever Augmented Generation) technique to generate the final answer based on a given context and question is structured as follows. It consists of two main sections—Instruction and Example—along with an Indicator section where actual variable values are mapped. Each section is explained below:
Instruction: Provides the overall guidance for the prompt.Example: Includes actual execution examples.
SyntheticGenerator
Based on the above tutorial, I have written it as a single class called SyntheticGenerator().
The overall flow is the same as the tutorial, and it can be used by executing the run() method. By specifying the desired path in save_path, the generated data will be saved as a CSV file, with each (Context, Query, Answer) triple-pair stored in a single row.
Last updated