LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Load Web-based documents
  • Load Multiple URLs Concurrently with alazy_load()
  • Key Points:
  • Load XML Documents
  • Basic XML Loading
  • Memory-Efficient Loading
  • Load Web-based Document Using Proxies
  • ⚠️ Warning:
  • Simple Web Content Loading with MarkItDown
  1. 06-DocumentLoader

WebBaseLoader

PreviousPDF LoaderNextCSV Loader

Last updated 28 days ago

  • Author:

  • Design:

  • Peer Review : ,

  • Author:

  • This is a part of

Overview

WebBaseLoader is a specialized document loader in LangChain designed for processing web-based content.

It leverages the BeautifulSoup4 library to parse web pages effectively, offering customizable parsing options through SoupStrainer and additional bs4 parameters.

This tutorial demonstrates how to use WebBaseLoader to:

  1. Load and parse web documents effectively

  2. Customize parsing behavior using BeautifulSoup options

  3. Handle different web content structures flexibly.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial markitdown
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)

Load Web-based documents

WebBaseLoader is a loader designed for loading web-based documents.

It uses the bs4 library to parse web pages.

Key Features:

  • Uses bs4.SoupStrainer to specify elements to parse.

  • Accepts additional arguments for bs4.SoupStrainer through the bs_kwargs parameter.

For more details, refer to the API documentation.

import bs4
from langchain_community.document_loaders import WebBaseLoader

# Load news article content using WebBaseLoader
loader = WebBaseLoader(
   web_paths=("https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/",),
   # Configure BeautifulSoup to parse only specific div elements
   bs_kwargs=dict(
       parse_only=bs4.SoupStrainer(
           "div",
           attrs={"class": ["entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained"]},
       )
   ),
   # Set user agent in request header to mimic browser
   header_template={
       "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
   },
)

# Load and process the documents
docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs[0]
Number of documents: 1
Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nCEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we have some work to do in 2025 to close the gap and establish a leadership position there as well.”\n“Scaling Gemini on the consumer side will be our biggest focus next year,” he said.\n')

To bypass SSL authentication errors, you can set the “verify” option.

# Bypass SSL certificate verification
loader.requests_kwargs = {"verify": False}

# Load documents from the web
docs = loader.load()
docs[0]
c:\Users\gram\AppData\Local\pypoetry\Cache\virtualenvs\langchain-opentutorial-RXtDr8w5-py3.11\Lib\site-packages\urllib3\connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'techcrunch.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
      warnings.warn(
Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nCEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we have some work to do in 2025 to close the gap and establish a leadership position there as well.”\n“Scaling Gemini on the consumer side will be our biggest focus next year,” he said.\n')

You can also load multiple webpages at once. To do this, you can pass a list of urls to the loader, which will return a list of documents in the order of the urls passed.

# Initialize the WebBaseLoader with web page paths and parsing configurations
loader = WebBaseLoader(
    web_paths=[
        # List of web pages to load
        "https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/",
        "https://techcrunch.com/2024/12/29/ai-data-centers-could-be-distorting-the-us-power-grid/",
    ],
    bs_kwargs=dict(
        # BeautifulSoup settings to parse only the specific content section
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained"]},
        )
    ),
    header_template={
        # Custom HTTP headers for the request (e.g., User-Agent for simulating a browser)
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
    },
)

# Load the data from the specified web pages
docs = loader.load()

# Check and print the number of documents loaded
print(len(docs))
2

Output the results fetched from the web.

print(docs[0].page_content[:500])
print("===" * 10)
print(docs[1].page_content[:500])
    We are at the dawn of a new space age.  If you doubt, simply look back at the last year: From SpaceX’s historic catch of the Super Heavy booster to the record-breaking number of lunar landing attempts, this year was full of historic and ambitious missions and demonstrations. 
    We’re taking a look back at the five most significant moments or trends in the space industry this year. Naysayers might think SpaceX is overrepresented on this list, but that just shows how far ahead the space behemoth is
    ==============================
    
    The proliferation of data centers aiming to meet the computational needs of AI could be bad news for the US power grid, according to a new report in Bloomberg.
    Using the 1 million residential sensors tracked by Whisker Labs, along with market intelligence data from DC Byte, Bloomberg found that more than half of the households showing the worst power distortions live within 20 miles of significant data center activity.
    
    
    
    
    
    
    
    
    In other words, there appears to be a link between data center proxi

Load Multiple URLs Concurrently with alazy_load()

You can speed up the process of scraping and parsing multiple URLs by using asynchronous loading. This allows you to fetch documents concurrently, improving efficiency while adhering to rate limits.

Key Points:

  • Rate Limit : The requests_per_second parameter controls how many requests are made per second. In this example, it's set to 1 to avoid overloading the server.

  • Asynchronous Loading : The alazy_load() function is used to load documents asynchronously, enabling faster processing of multiple URLs.

  • Jupyter Notebook Compatibility : If running in Jupyter Notebook, nest_asyncio is required to handle asynchronous tasks properly.

The code below demonstrates how to configure and load documents asynchronously:

# only for jupyter notebook (asyncio)
import nest_asyncio

nest_asyncio.apply()
# Set the requests per second rate limit
loader.requests_per_second = 1

# Load documents asynchronously
# The aload() is deprecated and alazy_load() is used since the langchain 3.14 update)
docs=[]
async for doc in loader.alazy_load():
    docs.append(doc)
# Display loaded documents
docs
[Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nCEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we have some work to do in 2025 to close the gap and establish a leadership position there as well.”\n“Scaling Gemini on the consumer side will be our biggest focus next year,” he said.\n')]

Load XML Documents

WebBaseLoader can process XML files by specifying a different BeautifulSoup parser. This is particularly useful when working with structured XML content like sitemaps or government data.

Basic XML Loading

The following example demonstrates loading an XML document from a government website:

from langchain_community.document_loaders import WebBaseLoader

# Initialize loader with XML document URL
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)

# Set parser to XML mode
loader.default_parser = "xml"

# Load and process the document
docs = loader.load()

Memory-Efficient Loading

For handling large documents, WebBaseLoader provides two memory-efficient loading methods:

  1. lazy_load() - loads one page at a time

  2. alazy_load() - asynchronous page loading for better performance

# Lazy Loading Example
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

# Print first 100 characters and metadata of the first page
print(pages[0].page_content[:100])
print(pages[0].metadata)
    
    10
    Energy
    3
    2018-01-01
    2018-01-01
    false
    Uniform test method for the measurement of energy efficien
    {'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
# Async Loading Example
pages = []
async for doc in loader.alazy_load():
    pages.append(doc)

# Print first 100 characters and metadata of the first page
print(pages[0].page_content[:100])
print(pages[0].metadata)
    
    10
    Energy
    3
    2018-01-01
    2018-01-01
    false
    Uniform test method for the measurement of energy efficien
    {'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}

Load Web-based Document Using Proxies

Sometimes you may need to use proxies to bypass IP blocking.

To use a proxy, you can pass a proxy dictionary to the loader (and its underlying requests library).

⚠️ Warning:

  • Replace {username}, {password}, and proxy.service.com with your actual proxy credentials and server information.

  • Without a valid proxy configuration, errors such as ProxyError or AuthenticationError may occur.

loader = WebBaseLoader(
   "https://www.google.com/search?q=parrots",
   proxies={
       "http": "http://{username}:{password}:@proxy.service.com:6666/",
       "https": "https://{username}:{password}:@proxy.service.com:6666/",
   },
   # Initialize the web loader with proxy settings
   # Configure proxy for both HTTP and HTTPS requests
)

# Load documents using the proxy
docs = loader.load()

Simple Web Content Loading with MarkItDown

Unlike WebBaseLoader which uses BeautifulSoup4 for sophisticated HTML parsing, MarkItDown provides a naive but simpler approach to web content loading. It directly fetches web content using HTTP requests and transfrom it into markdown format without detailed parsing capabilities.

Below is a basic example of loading web content using MarkItDown:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/")
result_text = result.text_content
print(result_text[:1000])
    
    [![](https://techcrunch.com/wp-content/uploads/2024/09/tc-lockup.svg) TechCrunch Desktop Logo](https://techcrunch.com)
    
    [![](https://techcrunch.com/wp-content/uploads/2024/09/tc-logo-mobile.svg) TechCrunch Mobile Logo](https://techcrunch.com)
    
    * [Latest](/latest/)
    * [Startups](/category/startups/)
    * [Venture](/category/venture/)
    * [Apple](/tag/apple/)
    * [Security](/category/security/)
    * [AI](/category/artificial-intelligence/)
    * [Apps](/category/apps/)
    
    * [Events](/events/)
    * [Podcasts](/podcasts/)
    * [Newsletters](/newsletters/)
    
    [Sign In](https://oidc.techcrunch.com/login/?dest=https%3A%2F%2Ftechcrunch.com%2F2024%2F12%2F28%2Frevisiting-the-biggest-moments-in-the-space-industry-in-2024%2F)
    [![]()](https://techcrunch.com/my-account/)
    
    SearchSubmit
    
    Site Search Toggle
    
    Mega Menu Toggle
    
    ### Topics
    
    [Latest](/latest/)
    
    [AI](/category/artificial-intelligence/)
    
    [Amazon](/tag/amazon/)
    
    [Apps](/category/apps/)
    
    [Biotech & Health](/category/biotech-health/)
    
    [Climate](/category/climate/)
    
    [

Set up the environment. You may refer to for more details.

You can checkout the for more details.

WebBaseLoader API Documentation
BeautifulSoup4 Documentation
Environment Setup
langchain-opentutorial
Kane
Kane
JoonHo Kim
Sunyoung Park (architectyou)
Yejin Park
LangChain Open Tutorial
Overview
Environment Setup
Load Web-based documents
Load Multiple URLs Concurrently with alazy_load
Load XML Documents
Load Web based document Using Proxies
Simple Web Content Loading with MarkItDown