RAG Basic WebBaseLoader

Open in ColabOpen in GitHub

Overview

This tutorial will cover the implementation of a news article QA app that can query the content of news articles using web data for RAG practice. This guide builds a RAG pipeline using OpenAI Chat models, Embedding, and ChromaDB vector store, utilizing Forbes News pages and Naver News pages which is the most popular news website in Korea.

1. Pre-processing - Steps 1 to 4

Pre-processing

The pre-processing stage involves four steps to load, split, embed, and store documents into a Vector DB (database).

  • Step 1: Document Load : Load the document content.

  • Step 2: Text Split : Split the document into chunks based on specific criteria.

  • Step 3: Embedding : Generate embeddings for the chunks and prepare them for storage.

  • Step 4: Vector DB Storage : Store the embedded chunks in the database.

2. RAG Execution (RunTime) - Steps 5 to 8

RAG Execution
  • Step 5: Retriever : Define a retriever to fetch results from the database based on the input query. Retrievers use search algorithms and are categorized as Dense or Sparse:

    • Dense : Similarity-based search.

    • Sparse : Keyword-based search.

  • Step 6: Prompt : Create a prompt for executing RAG. The context in the prompt includes content retrieved from the document. Through prompt engineering, you can specify the format of the answer.

  • Step 7: LLM : Define the language model (e.g., GPT, Clause, Gemini).

  • Step 8: Chain : Create a chain that connects the prompt, LLM, and output.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

If a warning is displayed due to the USER_AGENT not being set when using the WebBaseLoader,

please add USER_AGENT = myagent to the .env file

Web News Based QA(Question-Answering) Chatbot

In this tutorial we'll learn about the implementation of a news article QA app that can query the content of news articles using web data for RAG practice. This guide builds a RAG pipeline using OpenAI Chat models, Embedding, and FAISS vector store, utilizing Forbes News pages and Naver News pages which is the most popular news website in Korea.

First, through the following process, we can implement a simple indexing pipeline and RAG chain with approximately 20 lines of code.

[Note]

  • bs4 is a library for parsing web pages.

  • langchain is a library that provides various AI-related functionalities. Here, we'll specifically cover text splitting (RecursiveCharacterTextSplitter), document loading (WebBaseLoader), vector storage (Chroma, FAISS), output parsing (StrOutputParser), and runnable passthrough (RunnablePassthrough).

  • Through the langchain_openai module, we can use OpenAI's chatbot (ChatOpenAI) and embedding (OpenAIEmbeddings) functionalities.

We implement a process that loads web page content, splits text into chunks for indexing, and then searches for relevant text snippets to generate new content.

WebBaseLoader uses bs4.SoupStrainer to parse only the necessary parts from the specified web page.

[Note]

  • bs4.SoupStrainer allows you to conveniently retrieve desired elements from the web.

(Example)

You can retrieve the main news from the Forbes page and check its title and content as follows.

Similarly to the code tutorial above, you can load news articles from Naver news article pages using a similar method.

RecursiveCharacterTextSplitter splits documents into chunks of specified size.

Vector stores like FAISS or Chroma generate vector representations of documents based on these chunks.

The retriever created through vectorstore.as_retriever() generates new content using the prompt fetched with hub.pull and the ChatOpenAI model.

Finally, StrOutputParser parses the generated results into a string.

[Note] If you practice with Naver-News URL, you can download and input the teddynote/rag-prompt-korean prompt from hub (which is set in Korean).

In this case, the separate prompt writing process can be skipped.

To use streaming output, use stream_response.

Last updated