WebBaseLoader

Open in ColabOpen in GitHub

Overview

WebBaseLoader is a specialized document loader in LangChain designed for processing web-based content.

It leverages the BeautifulSoup4 library to parse web pages effectively, offering customizable parsing options through SoupStrainer and additional bs4 parameters.

This tutorial demonstrates how to use WebBaseLoader to:

  1. Load and parse web documents effectively

  2. Customize parsing behavior using BeautifulSoup options

  3. Handle different web content structures flexibly.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Load Web-based documents

WebBaseLoader is a loader designed for loading web-based documents.

It uses the bs4 library to parse web pages.

Key Features:

  • Uses bs4.SoupStrainer to specify elements to parse.

  • Accepts additional arguments for bs4.SoupStrainer through the bs_kwargs parameter.

For more details, refer to the API documentation.

To bypass SSL authentication errors, you can set the “verify” option.

You can also load multiple webpages at once. To do this, you can pass a list of urls to the loader, which will return a list of documents in the order of the urls passed.

Output the results fetched from the web.

Load Multiple URLs Concurrently with alazy_load()

You can speed up the process of scraping and parsing multiple URLs by using asynchronous loading. This allows you to fetch documents concurrently, improving efficiency while adhering to rate limits.

Key Points:

  • Rate Limit : The requests_per_second parameter controls how many requests are made per second. In this example, it's set to 1 to avoid overloading the server.

  • Asynchronous Loading : The alazy_load() function is used to load documents asynchronously, enabling faster processing of multiple URLs.

  • Jupyter Notebook Compatibility : If running in Jupyter Notebook, nest_asyncio is required to handle asynchronous tasks properly.

The code below demonstrates how to configure and load documents asynchronously:

Load XML Documents

WebBaseLoader can process XML files by specifying a different BeautifulSoup parser. This is particularly useful when working with structured XML content like sitemaps or government data.

Basic XML Loading

The following example demonstrates loading an XML document from a government website:

Memory-Efficient Loading

For handling large documents, WebBaseLoader provides two memory-efficient loading methods:

  1. lazy_load() - loads one page at a time

  2. alazy_load() - asynchronous page loading for better performance

Load Web-based Document Using Proxies

Sometimes you may need to use proxies to bypass IP blocking.

To use a proxy, you can pass a proxy dictionary to the loader (and its underlying requests library).

⚠️ Warning:

  • Replace {username}, {password}, and proxy.service.com with your actual proxy credentials and server information.

  • Without a valid proxy configuration, errors such as ProxyError or AuthenticationError may occur.

Simple Web Content Loading with MarkItDown

Unlike WebBaseLoader which uses BeautifulSoup4 for sophisticated HTML parsing, MarkItDown provides a naive but simpler approach to web content loading. It directly fetches web content using HTTP requests and transfrom it into markdown format without detailed parsing capabilities.

Below is a basic example of loading web content using MarkItDown:

Last updated