Understanding the basic structure of RAG

Author: Sun Hyoung Lee
Peer Review:
Proofread : BokyungisaGod
This is a part of LangChain Open Tutorial

Overview

1. Pre-processing - Steps 1 to 4

The pre-processing stage involves four steps to load, split, embed, and store documents into a Vector DB (database).

Step 1: Document Load : Load the document content.
Step 2: Text Split : Split the document into chunks based on specific criteria.
Step 3: Embedding : Generate embeddings for the chunks and prepare them for storage.
Step 4: Vector DB Storage : Store the embedded chunks in the database.

2. RAG Execution (RunTime) - Steps 5 to 8

Step 5: Retriever : Define a retriever to fetch results from the database based on the input query. Retrievers use search algorithms and are categorized as dense or sparse:
- Dense : Similarity-based search.
- Sparse : Keyword-based search.
Step 6: Prompt : Create a prompt for executing RAG. The context in the prompt includes content retrieved from the document. Through prompt engineering, you can specify the format of the answer.
Step 7: LLM : Define the language model (e.g., GPT-3.5, GPT-4, Claude).
Step 8: Chain : Create a chain that connects the prompt, LLM, and output.

References

LangChain How-to guides : Q&A with RAG

Document Used for Practice A European Approach to Artificial Intelligence - A Policy Perspective

Author: EIT Digital and 5 EIT KICs (EIT Manufacturing, EIT Urban Mobility, EIT Health, EIT Climate-KIC, EIT Digital)
Link: https://eit.europa.eu/sites/default/files/eit-digital-artificial-intelligence-report.pdf
File Name: A European Approach to Artificial Intelligence - A Policy Perspective.pdf

Please copy the downloaded file to the data folder for practice.

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

Set the API key.

# Install required packages
from langchain_opentutorial import package

package.install(
    [   
        "langchain_community",
        "langsmith"
        "langchain"
        "langchain_text_splitters"
        "langchain_core"
        "langchain_openai"
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {   "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RAG-Basic-PDF",
    }
)

Environment variables have been set successfully.

# Configuration file for managing API keys as environment variables
from dotenv import load_dotenv

# Load API key information
load_dotenv(override=True)

True

RAG Basic Pipeline

Below is the skeleton code for understanding the basic structure of RAG (Retrieval Augmented Generation).

The content of each module can be adjusted to fit specific scenarios, allowing for iterative improvement of the structure to suit the documents.

(Different options or new techniques can be applied at each step.)

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Step 1: Load Documents
loader = PyMuPDFLoader("./data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")
docs = loader.load()
print(f"Number of pages in the document: {len(docs)}")

Number of pages in the document: 24

Print the content of the page.

print(docs[10].page_content)

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    11
    GENERIC 
    There are five issues that, though from slightly different angles, 
    are considered strategic and a potential source of barriers and 
    bottlenecks: data, organisation, human capital, trust, markets. The 
    availability and quality of data, as well as data governance are of 
    strategic importance. Strictly technical issues (i.e., inter-operabi-
    lity, standardisation) are mostly being solved, whereas internal and 
    external data governance still restrain the full potential of AI Inno-
    vation. Organisational resources and, also, cognitive and cultural 
    routines are a challenge to cope with for full deployment. On the 
    one hand, there is the issue of the needed investments when evi-
    dence on return is not yet consolidated. On the other hand, equally 
    important, are cultural conservatism and misalignment between 
    analytical and business objectives. Skills shortages are a main 
    bottleneck in all the four sectors considered in this report where 
    upskilling, reskilling, and new skills creation are considered crucial. 
    For many organisations data scientists are either too expensive or 
    difficult to recruit and retain. There is still a need to build trust on 
    AI, amongst both the final users (consumers, patients, etc.) and 
    intermediate / professional users (i.e., healthcare professionals). 
    This is a matter of privacy and personal data protection, of building 
    a positive institutional narrative backed by mitigation strategies, 
    and of cumulating evidence showing that benefits outweigh costs 
    and risks. As demand for AI innovations is still limited (in many 
    sectors a ‘wait and see’ approach is prevalent) this does not fa-
    vour the emergence of a competitive supply side. Few start-ups 
    manage to scale up, and many are subsequently bought by a few 
    large dominant players. As a result of the fact that these issues 
    have not yet been solved on a large scale, using a 5 levels scale 
    GENERIC AND CONTEXT DEPENDING 
    OPPORTUNITIES AND POLICY LEVERS
    of deployment maturity (1= not started; 2= experimentation; 3= 
    practitioner use; 4= professional use; and 5= AI driven companies), 
    it seems that, in all four vertical domains considered, adoption re-
    mains at level 2 (experimentation) or 3 (practitioner use), with only 
    few advanced exceptions mostly in Manufacturing and Health-
    care. In Urban Mobility, as phrased by interviewed experts, only 
    lightweight AI applications are widely adopted, whereas in the Cli-
    mate domain we are just at the level of early predictive models. 
    Considering the different areas of AI applications, regardless of the 
    domains, the most adopted ones include predictive maintenance, 
    chatbots, voice/text recognition, NPL, imagining, computer vision 
    and predictive analytics.
    MANUFACTURING 
    The manufacturing sector is one of the leaders in application of 
    AI technologies; from significant cuts in unplanned downtime to 
    better designed products, manufacturers are applying AI-powe-
    red analytics to data to improve efficiency, product quality and 
    the safety of employees. The key application of AI is certainly in 
    predictive maintenance. Yet, the more radical transformation of 
    manufacturing will occur when manufacturers will move to ‘ser-
    vice-based’ managing of the full lifecycle from consumers pre-
    ferences to production and delivery (i.e., the Industry 4.0 vision). 
    Manufacturing companies are investing into this vision and are 
    keen to protect their intellectual property generated from such in-
    vestments. So, there is a concern that a potential new legislative 
    action by the European Commission, which would follow the prin-
    ciples of the GDPR and the requirements of the White Paper, may

Check the metadata.

docs[10].__dict__

{'id': None,
     'metadata': {'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
      'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
      'page': 10,
      'total_pages': 24,
      'format': 'PDF 1.4',
      'title': '',
      'author': '',
      'subject': '',
      'keywords': '',
      'creator': 'Adobe InDesign 17.3 (Macintosh)',
      'producer': 'Adobe PDF Library 16.0.7',
      'creationDate': "D:20220823105611+02'00'",
      'modDate': "D:20220823105617+02'00'",
      'trapped': ''},
     'page_content': 'A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n11\nGENERIC \nThere are five issues that, though from slightly different angles, \nare considered strategic and a potential source of barriers and \nbottlenecks: data, organisation, human capital, trust, markets. The \navailability and quality of data, as well as data governance are of \nstrategic importance. Strictly technical issues (i.e., inter-operabi-\nlity, standardisation) are mostly being solved, whereas internal and \nexternal data governance still restrain the full potential of AI Inno-\nvation. Organisational resources and, also, cognitive and cultural \nroutines are a challenge to cope with for full deployment. On the \none hand, there is the issue of the needed investments when evi-\ndence on return is not yet consolidated. On the other hand, equally \nimportant, are cultural conservatism and misalignment between \nanalytical and business objectives. Skills shortages are a main \nbottleneck in all the four sectors considered in this report where \nupskilling, reskilling, and new skills creation are considered crucial. \nFor many organisations data scientists are either too expensive or \ndifficult to recruit and retain. There is still a need to build trust on \nAI, amongst both the final users (consumers, patients, etc.) and \nintermediate / professional users (i.e., healthcare professionals). \nThis is a matter of privacy and personal data protection, of building \na positive institutional narrative backed by mitigation strategies, \nand of cumulating evidence showing that benefits outweigh costs \nand risks. As demand for AI innovations is still limited (in many \nsectors a ‘wait and see’ approach is prevalent) this does not fa-\nvour the emergence of a competitive supply side. Few start-ups \nmanage to scale up, and many are subsequently bought by a few \nlarge dominant players. As a result of the fact that these issues \nhave not yet been solved on a large scale, using a 5 levels scale \nGENERIC AND CONTEXT DEPENDING \nOPPORTUNITIES AND POLICY LEVERS\nof deployment maturity (1= not started; 2= experimentation; 3= \npractitioner use; 4= professional use; and 5= AI driven companies), \nit seems that, in all four vertical domains considered, adoption re-\nmains at level 2 (experimentation) or 3 (practitioner use), with only \nfew advanced exceptions mostly in Manufacturing and Health-\ncare. In Urban Mobility, as phrased by interviewed experts, only \nlightweight AI applications are widely adopted, whereas in the Cli-\nmate domain we are just at the level of early predictive models. \nConsidering the different areas of AI applications, regardless of the \ndomains, the most adopted ones include predictive maintenance, \nchatbots, voice/text recognition, NPL, imagining, computer vision \nand predictive analytics.\nMANUFACTURING \nThe manufacturing sector is one of the leaders in application of \nAI technologies; from significant cuts in unplanned downtime to \nbetter designed products, manufacturers are applying AI-powe-\nred analytics to data to improve efficiency, product quality and \nthe safety of employees. The key application of AI is certainly in \npredictive maintenance. Yet, the more radical transformation of \nmanufacturing will occur when manufacturers will move to ‘ser-\nvice-based’ managing of the full lifecycle from consumers pre-\nferences to production and delivery (i.e., the Industry 4.0 vision). \nManufacturing companies are investing into this vision and are \nkeen to protect their intellectual property generated from such in-\nvestments. So, there is a concern that a potential new legislative \naction by the European Commission, which would follow the prin-\nciples of the GDPR and the requirements of the White Paper, may \n',
     'type': 'Document'}

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
print(f"Number of split chunks: {len(split_documents)}")

Number of split chunks: 163

# Step 3: Generate Embeddings
embeddings = OpenAIEmbeddings()

# Step 4: Create and Save the Database
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

for doc in vectorstore.similarity_search("URBAN MOBILITY"):
    print(doc.page_content)

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    14
    Table 3: Urban Mobility: concerns, opportunities and policy levers.
    URBAN MOBILITY
    The adoption of AI in the management of urban mobility systems 
    brings different sets of benefits for private stakeholders (citizens, 
    private companies) and public stakeholders (municipalities, trans-
    portation service providers). So far only light-weight task specific 
    AI applications have been deployed (i.e., intelligent routing, sharing
    A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    15
    One of the most interesting development close to scale up is the 
    creation of platforms, which are fed by all different data sources 
    of transport services (both private and public) and provide the ci-
    tizens a targeted recommendation on the best way to travel, also 
    based on personal preferences and characteristics. 
    Urban Mobility should focus on what is already potentially avai-
    apps, predictive models based on citizens’ location and personal 
    data). On the other hand, the most advanced and transformative 
    AI applications, such as autonomous vehicles are lagging behind, 
    especially if compared to the US or China. The key challenge for AI 
    deployment in Urban Mobility sector is the need to find a common 
    win-win business model across a diversity of public and private 
    sector players with different organisational objectives, cultures,
    care. In Urban Mobility, as phrased by interviewed experts, only 
    lightweight AI applications are widely adopted, whereas in the Cli-
    mate domain we are just at the level of early predictive models. 
    Considering the different areas of AI applications, regardless of the 
    domains, the most adopted ones include predictive maintenance, 
    chatbots, voice/text recognition, NPL, imagining, computer vision 
    and predictive analytics.
    MANUFACTURING

# Step 5: Create Retriever
# Search and retrieve information contained in the documents.
retriever = vectorstore.as_retriever()

Send a query to the retriever and check the resulting chunks.

retriever.invoke("What is the phased implementation timeline for the EU AI Act?")

[Document(id='fdfb5187-141a-4693-b5d0-e1066b0ef27f', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 9, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': "D:20220823105611+02'00'", 'modDate': "D:20220823105617+02'00'", 'trapped': ''}, page_content='A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n10\nrequirements becomes mandatory in all sectors and create bar-\nriers especially for innovators and SMEs. Public procurement ‘data \nsovereignty clauses’ induce large players to withdraw from AI for \nurban ecosystems. Strict liability sanctions block AI in healthcare, \nwhile limiting space of self-driving experimentation. The support \nmeasures to boost European AI are not sufficient to offset the'),
     Document(id='5aada0ed-9a07-4c9b-a290-d24856d64494', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 22, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': "D:20220823105611+02'00'", 'modDate': "D:20220823105617+02'00'", 'trapped': ''}, page_content='A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n23\nACKNOWLEDGEMENTS\nIn the context of their strategic innovation activities for Europe, five EIT Knowledge and Innovation Communities (EIT Manufacturing, EIT Ur-\nban Mobility, EIT Health, EIT Climate-KIC, and EIT Digital as coordinator) decided to launch a study that identifies general and sector specific \nconcerns and opportunities for the deployment of AI in Europe.'),
     Document(id='37657411-894d-4e9c-975b-d1a99ef0e20a', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 21, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': "D:20220823105611+02'00'", 'modDate': "D:20220823105617+02'00'", 'trapped': ''}, page_content='sion/presscorner/detail/en/IP_18_6689.\nEuropean Commission. (2020a). White Paper on Artificial Intelligence. A European Ap-\nproach to Excellence and Trust. COM(2020) 65 final, Brussels: European Commission. \nEuropean Commission. (2020b). A European Strategy to Data. COM(2020) 66 final, Brus-\nsels: European Commission. \nEuropean Parliament. (2020). Digital sovereignty for Europe. Brussels: European Parliament \n(retrieved from: https://www.europarl.europa.eu/RegData/etudes/BRIE/2020/651992/'),
     Document(id='1aa90862-fe35-4797-ad6a-225f9da47824', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 5, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': "D:20220823105611+02'00'", 'modDate': "D:20220823105617+02'00'", 'trapped': ''}, page_content='ries and is the result of a combined effort from five EIT KICs (EIT \nManufacturing, EIT Urban Mobility, EIT Health, EIT Climate-KIC, \nand EIT Digital as coordinator). It identifies both general and sec-\ntor specific concerns and opportunities for the further deployment \nof AI in Europe. Starting from the background and policy context \noutlined in this introduction, some critical aspects of AI are fur-\nther discussed in Section 2. Next, in Section 3 four scenarios')]

# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Setup LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Input a query (question) into the created chain and execute it.

# Run Chain
# Input a query about the document and print the response.
question = "Where has the application of AI in healthcare been confined to so far?"
response = chain.invoke(question)
print(response)

The application of AI in healthcare has so far been confined to administrative tasks, such as Natural Language Processing to extract information from clinical notes or predictive scheduling of visits, and diagnostic tasks, including machine and deep learning applied to imaging in radiology, pathology, and dermatology.

Complete Code

This is a combined code that integrates steps 1 through 8.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Step 1: Load Documents
loader = PyMuPDFLoader("./data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")
docs = loader.load()

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# Step 3: Generate Embeddings
embeddings = OpenAIEmbeddings()

# Step 4: Create and Save the Database
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# Step 5: Create Retriever
# Search and retrieve information contained in the documents.
retriever = vectorstore.as_retriever()

# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Load LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Run Chain
# Input a query about the document and print the response.
question = "Where has the application of AI in healthcare been confined to so far?"
response = chain.invoke(question)
print(response)

The application of AI in healthcare has so far been confined to administrative tasks, such as Natural Language Processing to extract information from clinical notes or predictive scheduling of visits, and diagnostic tasks, including machine and deep learning applied to imaging in radiology, pathology, and dermatology.

Previous12-RAG NextRAG Basic WebBaseLoader

Last updated 3 months ago