WebBaseLoader
Author: Kane
Design: Kane
Peer Review : JoonHo Kim, Sunyoung Park (architectyou)
Author: Yejin Park
This is a part of LangChain Open Tutorial
Overview
WebBaseLoader is a specialized document loader in LangChain designed for processing web-based content.
It leverages the BeautifulSoup4 library to parse web pages effectively, offering customizable parsing options through SoupStrainer and additional bs4 parameters.
This tutorial demonstrates how to use WebBaseLoader to:
Load and parse web documents effectively
Customize parsing behavior using
BeautifulSoupoptionsHandle different web content structures flexibly.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
Load Web-based documents
WebBaseLoader is a loader designed for loading web-based documents.
It uses the bs4 library to parse web pages.
Key Features:
Uses
bs4.SoupStrainerto specify elements to parse.Accepts additional arguments for
bs4.SoupStrainerthrough thebs_kwargsparameter.
For more details, refer to the API documentation.
To bypass SSL authentication errors, you can set the “verify” option.
You can also load multiple webpages at once. To do this, you can pass a list of urls to the loader, which will return a list of documents in the order of the urls passed.
Output the results fetched from the web.
Load Multiple URLs Concurrently with alazy_load()
alazy_load()You can speed up the process of scraping and parsing multiple URLs by using asynchronous loading. This allows you to fetch documents concurrently, improving efficiency while adhering to rate limits.
Key Points:
Rate Limit : The
requests_per_secondparameter controls how many requests are made per second. In this example, it's set to 1 to avoid overloading the server.Asynchronous Loading : The
alazy_load()function is used to load documents asynchronously, enabling faster processing of multiple URLs.Jupyter Notebook Compatibility : If running in Jupyter Notebook,
nest_asynciois required to handle asynchronous tasks properly.
The code below demonstrates how to configure and load documents asynchronously:
Load XML Documents
WebBaseLoader can process XML files by specifying a different BeautifulSoup parser. This is particularly useful when working with structured XML content like sitemaps or government data.
Basic XML Loading
The following example demonstrates loading an XML document from a government website:
Memory-Efficient Loading
For handling large documents, WebBaseLoader provides two memory-efficient loading methods:
lazy_load() - loads one page at a time
alazy_load() - asynchronous page loading for better performance
Load Web-based Document Using Proxies
Sometimes you may need to use proxies to bypass IP blocking.
To use a proxy, you can pass a proxy dictionary to the loader (and its underlying requests library).
⚠️ Warning:
Replace
{username},{password}, andproxy.service.comwith your actual proxy credentials and server information.Without a valid proxy configuration, errors such as ProxyError or AuthenticationError may occur.
Simple Web Content Loading with MarkItDown
MarkItDownUnlike WebBaseLoader which uses BeautifulSoup4 for sophisticated HTML parsing, MarkItDown provides a naive but simpler approach to web content loading. It directly fetches web content using HTTP requests and transfrom it into markdown format without detailed parsing capabilities.
Below is a basic example of loading web content using MarkItDown:
Last updated