LangChain OpenTutorial
  • šŸ¦œļøšŸ”— The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • How to load PDFs
  • PyPDF
  • PyPDF(OCR)
  • PyPDF Directory
  • PyMuPDF
  • Unstructured
  • PyPDFium2
  • PDFMiner
  • Using PDFMiner to generate HTML text
  1. 06-DocumentLoader

PDF Loader

PreviousDocument & Document LoaderNextWebBaseLoader

Last updated 28 days ago

  • Author:

  • Peer Review : ,

  • Author:

  • This is a part of

Overview

This tutorial covers various PDF processing methods using LangChain and popular PDF libraries.

PDF processing is essential for extracting and analyzing text data from PDF documents.

In this tutorial, we will explore different PDF loaders and their capabilities while working with LangChain's document processing framework.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "langchain_text_splitters",
        "pypdf",
        "rapidocr-onnxruntime",
        "pymupdf",
        "unstructured[pdf]"
    ],
    verbose=False,
    upgrade=False,
)
    [notice] A new release of pip available: 22.3.1 -> 24.3.1
    [notice] To update, run: pip install --upgrade pip
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "PDFLoader",
    }
)
Environment variables have been set successfully.

How to load PDFs

LangChain integrates with a variety of PDF parsers. Some are simple and relatively low-level, while others support OCR and image processing or perform advanced document layout analysis.

The right choice depends on your application.

FILE_PATH = "./data/layout-parser-paper.pdf"
def show_metadata(docs):
    if docs:
        print("[metadata]")
        print(list(docs[0].metadata.keys()))
        print("\n[examples]")
        max_key_length = max(len(k) for k in docs[0].metadata.keys())
        for k, v in docs[0].metadata.items():
            print(f"{k:<{max_key_length}} : {v}")

PyPDF

Here we use PyPDF to load the PDF as an list of Document objects

from langchain_community.document_loaders import PyPDFLoader

# Initialize the PDF loader
loader = PyPDFLoader(FILE_PATH)

# Load data into Document objects
docs = loader.load()

# Print the contents of the document
print(docs[10].page_content[:300])
LayoutParser: A Unified Toolkit for DL-Based DIA 11
    focuses on precision, efficiency, and robustness. The target documents may have
    complicated structures, and may require training multiple layout detection models
    to achieve the optimal accuracy. Light-weight pipelines are built for relatively
    simple d
# output metadata
show_metadata(docs)
[metadata]
    ['source', 'page']
    
    [examples]
    source : ./data/layout-parser-paper.pdf
    page   : 0

The load_and_split() method allows customizing how documents are chunked by passing a text splitter object, making it more flexible for different use cases.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load Documents and split into chunks. Chunks are returned as Documents.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=200)
docs = loader.load_and_split(text_splitter=text_splitter)
print(docs[0].page_content)
LayoutParser: A Unified Toolkit for Deep
    Learning Based Document Image Analysis
    Zejiang Shen1 (ļæ½ ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
    Lee4, Jacob Carlson3, and Weining Li5

PyPDF(OCR)

Some PDFs contain text images within scanned documents or pictures. You can also use the rapidocr-onnxruntime package to extract text from images.

# Initialize PDF loader, enable image extraction option
loader = PyPDFLoader(FILE_PATH, extract_images=True)

# load PDF page
docs = loader.load()

# access page content
print(docs[4].page_content[:300])
LayoutParser: A Unified Toolkit for DL-Based DIA 5
    Table 1: Current layout detection models in the LayoutParser model zoo
    Dataset Base Model1 Large ModelNotes
    PubLayNet [38] F / M M Layouts of modern scientific documents
    PRImA [3] M - Layouts of scanned modern magazines and scientific reports
    Newspaper
show_metadata(docs)
[metadata]
    ['source', 'page']
    
    [examples]
    source : ./data/layout-parser-paper.pdf
    page   : 0

PyPDF Directory

Import all PDF documents from directory.

from langchain_community.document_loaders import PyPDFDirectoryLoader

# directory path
loader = PyPDFDirectoryLoader("./data/")

# load documents
docs = loader.load()

# print the number of documents
docs_len = len(docs)
print(docs_len)

# get document from a directory
document = docs[docs_len - 1]
16
# print the contents of the document
print(document.page_content[:300])
16 Z. Shen et al.
    [23] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
    Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
    [24] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
    T., Lin, Z., Gimelshein, N., An
print(document.metadata)
{'source': 'data/layout-parser-paper.pdf', 'page': 15}

PyMuPDF

from langchain_community.document_loaders import PyMuPDFLoader

# create an instance of the PyMuPDF loader
loader = PyMuPDFLoader(FILE_PATH)

# load the document
docs = loader.load()

# print the contents of the document
print(docs[10].page_content[:300])
LayoutParser: A Unified Toolkit for DL-Based DIA
    11
    focuses on precision, efficiency, and robustness. The target documents may have
    complicated structures, and may require training multiple layout detection models
    to achieve the optimal accuracy. Light-weight pipelines are built for relatively
    simple d
show_metadata(docs)
[metadata]
    ['source', 'file_path', 'page', 'total_pages', 'format', 'title', 'author', 'subject', 'keywords', 'creator', 'producer', 'creationDate', 'modDate', 'trapped']
    
    [examples]
    source       : ./data/layout-parser-paper.pdf
    file_path    : ./data/layout-parser-paper.pdf
    page         : 0
    total_pages  : 16
    format       : PDF 1.5
    title        : 
    author       : 
    subject      : 
    keywords     : 
    creator      : LaTeX with hyperref
    producer     : pdfTeX-1.40.21
    creationDate : D:20210622012710Z
    modDate      : D:20210622012710Z
    trapped      : 

Unstructured

from langchain_community.document_loaders import UnstructuredPDFLoader

# create an instance of UnstructuredPDFLoader
loader = UnstructuredPDFLoader(FILE_PATH)

# load the data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])
Matplotlib is building the font cache; this may take a moment.
1 2 0 2

n u J

1 2

]

V C . s c [

2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5

1 Allen Institute for AI s
show_metadata(docs)
[metadata]
    ['source']
    
    [examples]
    source : ./data/layout-parser-paper.pdf

Internally, unstructured creates different "elements" for each chunk of text. By default, these are combined, but can be easily separated by specifying mode="elements".

# Create an instance of UnstructuredPDFLoader (mode="elementsā€)
loader = UnstructuredPDFLoader(FILE_PATH, mode="elements")

# load the data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content)
1 2 0 2

See the full set of element types for this particular article.

set(doc.metadata["category"] for doc in docs) # extract data categories
{'ListItem', 'NarrativeText', 'Title', 'UncategorizedText'}
show_metadata(docs)
[metadata]
    ['source', 'coordinates', 'file_directory', 'filename', 'languages', 'last_modified', 'page_number', 'filetype', 'category', 'element_id']
    
    [examples]
    source         : ./data/layout-parser-paper.pdf
    coordinates    : {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}
    file_directory : ./data
    filename       : layout-parser-paper.pdf
    languages      : ['eng']
    last_modified  : 2025-01-02T18:23:25
    page_number    : 1
    filetype       : application/pdf
    category       : UncategorizedText
    element_id     : d3ce55f220dfb75891b4394a18bcb973

PyPDFium2

from langchain_community.document_loaders import PyPDFium2Loader

# create an instance of the PyPDFium2 loader
loader = PyPDFium2Loader(FILE_PATH)

# load data
docs = loader.load()

# print the contents of the document
print(docs[10].page_content[:300])
LayoutParser: A Unified Toolkit for DL-Based DIA 11

    focuses on precision, efficiency, and robustness. The target documents may have

    complicated structures, and may require training multiple layout detection models

    to achieve the optimal accuracy. Light-weight pipelines are built for relatively

    s

Note: When using PyPDFium2Loader, you may notice warning messages related to get_text_range(). These warnings are part of the library's internal operations and do not affect the PDF processing functionality. You can safely proceed with the tutorial despite these warnings, as they are a normal part of the development environment and do not impact the learning objectives.

show_metadata(docs)
[metadata]
    ['source', 'page']
    
    [examples]
    source : ./data/layout-parser-paper.pdf
    page   : 0

PDFMiner

from langchain_community.document_loaders import PDFMinerLoader

# Create a PDFMiner loader instance
loader = PDFMinerLoader(FILE_PATH)

# load data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])
1
    2
    0
    2
    
    n
    u
    J
    
    1
    2
    
    ]
    
    V
    C
    .
    s
    c
    [
    
    2
    v
    8
    4
    3
    5
    1
    .
    3
    0
    1
    2
    :
    v
    i
    X
    r
    a
    
    LayoutParser: A Unified Toolkit for Deep
    Learning Based Document Image Analysis
    
    Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
    Lee4, Jacob Carlson3, and Weining Li5
    
    1 Allen Institute for AI
    s
show_metadata(docs)
[metadata]
    ['source']
    
    [examples]
    source : ./data/layout-parser-paper.pdf

Using PDFMiner to generate HTML text

from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

# create an instance of PDFMinerPDFasHTMLLoader
loader = PDFMinerPDFasHTMLLoader(FILE_PATH)

# load the document
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])

    
    
    
    Page 1
    
show_metadata(docs)
[metadata]    ['source']        [examples]    source : ./data/layout-parser-paper.pdf
from bs4 import BeautifulSoupsoup = BeautifulSoup(docs[0].page_content, "html.parser") # initialize HTML parsercontent = soup.find_all("div") # search for all div tags
import recur_fs = Nonecur_text = ""snippets = []  # collect all snippets of the same font sizefor c in content:    sp = c.find("span")    if not sp:        continue    st = sp.get("style")    if not st:        continue    fs = re.findall("font-size:(\d+)px", st)    if not fs:        continue    fs = int(fs[0])    if not cur_fs:        cur_fs = fs    if fs == cur_fs:        cur_text += c.text    else:        snippets.append((cur_text, cur_fs))        cur_fs = fs        cur_text = c.textsnippets.append((cur_text, cur_fs))# Note: Possibility to add a strategy for removing duplicate snippets (since the header/footer of a PDF appears across multiple pages, it can be considered duplicate information when found)
from langchain_core.documents import Documentcur_idx = -1semantic_snippets = []# Assumption: headings have higher font size than their respective contentfor s in snippets:    # if current snippet's font size > previous section's heading => it is a new heading    if (        not semantic_snippets        or s[1] > semantic_snippets[cur_idx].metadata["heading_font"]    ):        metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}        metadata.update(docs[0].metadata)        semantic_snippets.append(Document(page_content="", metadata=metadata))        cur_idx += 1        continue    # if current snippet's font size <= previous section's content => content belongs to the same section (one can also create    if (        not semantic_snippets[cur_idx].metadata["content_font"]        or s[1] <= semantic_snippets[cur_idx].metadata["content_font"]    ):        semantic_snippets[cur_idx].page_content += s[0]        semantic_snippets[cur_idx].metadata["content_font"] = max(            s[1], semantic_snippets[cur_idx].metadata["content_font"]        )        continue    # if current snippet's font size > previous section's content but less than previous section's heading than also make a new    metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}    metadata.update(docs[0].metadata)    semantic_snippets.append(Document(page_content="", metadata=metadata))    cur_idx += 1print(semantic_snippets[4])
page_content='Recently, various DL models and datasets have been developed for layout analysis    tasks. The dhSegment [22] utilizes fully convolutional networks [20] for segmen-    tation tasks on historical documents. Object detection-based methods like Faster    R-CNN [28] and Mask R-CNN [12] are used for identifying document elements [38]    and detecting tables [30, 26]. Most recently, Graph Neural Networks [29] have also    been used in table detection [27]. However, these models are usually implemented    individually and there is no unified framework to load and use such models.    There has been a surge of interest in creating open-source tools for document    image processing: a search of document image analysis in Github leads to 5M    relevant code pieces 6; yet most of them rely on traditional rule-based methods    or provide limited functionalities. The closest prior research to our work is the    OCR-D project7, which also tries to build a complete toolkit for DIA. However,    similar to the platform developed by Neudecker et al. [21], it is designed for    analyzing historical documents, and provides no supports for recent DL models.    The DocumentLayoutAnalysis project8 focuses on processing born-digital PDF    documents via analyzing the stored PDF data. Repositories like DeepLayout9    and Detectron2-PubLayNet10 are individual deep learning models trained on    layout analysis datasets without support for the full DIA pipeline. The Document    Analysis and Exploitation (DAE) platform [15] and the DeepDIVA project [2]    aim to improve the reproducibility of DIA methods (or DL models), yet they    are not actively maintained. OCR engines like Tesseract [14], easyOCR11 and    paddleOCR12 usually do not come with comprehensive functionalities for other    DIA tasks like layout analysis.    Recent years have also seen numerous efforts to create libraries for promoting    reproducibility and reusability in the field of DL. Libraries like Dectectron2 [35],    6 The number shown is obtained by specifying the search type as ā€˜code’.    7 https://ocr-d.de/en/about    8 https://github.com/BobLd/DocumentLayoutAnalysis    9 https://github.com/leonlulu/DeepLayout    10 https://github.com/hpanwar08/detectron2    11 https://github.com/JaidedAI/EasyOCR    12 https://github.com/PaddlePaddle/PaddleOCR    4    Z. Shen et al.    Fig. 1: The overall architecture of LayoutParser. For an input document image,    the core LayoutParser library provides a set of off-the-shelf tools for layout    detection, OCR, visualization, and storage, backed by a carefully designed layout    data structure. LayoutParser also supports high level customization via efficient    layout annotation and model training functions. These improve model accuracy    on the target samples. The community platform enables the easy sharing of DIA    models and whole digitization pipelines to promote reusability and reproducibility.    A collection of detailed documentation, tutorials and exemplar projects make    LayoutParser easy to learn and use.    AllenNLP [8] and transformers [34] have provided the community with complete    DL-based support for developing and deploying models for general computer    vision and natural language processing problems. LayoutParser, on the other    hand, specializes specifically in DIA tasks. LayoutParser is also equipped with a    community platform inspired by established model hubs such as Torch Hub [23]    and TensorFlow Hub [1]. It enables the sharing of pretrained models as well as    full document processing pipelines that are unique to DIA tasks.    There have been a variety of document data collections to facilitate the    development of DL models. Some examples include PRImA [3](magazine layouts),    PubLayNet [38](academic paper layouts), Table Bank [18](tables in academic    papers), Newspaper Navigator Dataset [16, 17](newspaper figure layouts) and    HJDataset [31](historical Japanese document layouts). A spectrum of models    trained on these datasets are currently available in the LayoutParser model zoo    to support different use cases.    ' metadata={'heading': '2 Related Work\n', 'content_font': 9, 'heading_font': 11, 'source': './data/layout-parser-paper.pdf'}
PDFPlumber
 is a PDF parsing library that excels at extracting text and tables from PDFs.
LangChain's  integrates with PDFPlumber to parse PDF documents into LangChain Document objects.
Like PyMuPDF, the output document contains detailed metadata about the PDF and its pages, and returns one document per page.
from langchain_community.document_loaders import PDFPlumberLoader# create a PDF document loader instanceloader = PDFPlumberLoader(FILE_PATH)# load the documentdocs = loader.load()# access the first document dataprint(docs[10].page_content[:300])
LayoutParser: A Unified Toolkit for DL-Based DIA 11    focuses on precision, efficiency, and robustness. The target documents may have    complicatedstructures,andmayrequiretrainingmultiplelayoutdetectionmodels    to achieve the optimal accuracy. Light-weight pipelines are built for relatively    simple documen
show_metadata(docs)
[metadata]    ['source', 'file_path', 'page', 'total_pages', 'Author', 'CreationDate', 'Creator', 'Keywords', 'ModDate', 'PTEX.Fullbanner', 'Producer', 'Subject', 'Title', 'Trapped']        [examples]    source          : ./data/layout-parser-paper.pdf    file_path       : ./data/layout-parser-paper.pdf    page            : 0    total_pages     : 16    Author          :     CreationDate    : D:20210622012710Z    Creator         : LaTeX with hyperref    Keywords        :     ModDate         : D:20210622012710Z    PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2    Producer        : pdfTeX-1.40.21    Subject         :     Title           :     Trapped         : False

Set up the environment. You may refer to for more details.

You can checkout the for more details.

, a file format standardized by ISO 32000, was developed by Adobe in 1992 for presenting documents, which include text formatting and images in a way that is independent of application software, hardware, and operating systems.

This guide covers how to load a PDF document into the LangChain format. This format will be used downstream.

We will demonstrate these approaches on a . Download the sample file and copy it to your data folder.

is one of the most widely used Python libraries for PDF processing.

LangChain's integrates with PyPDF to parse PDF documents into LangChain Document objects.

is speed optimized and includes detailed metadata about the PDF and its pages. It returns one document per page.

LangChain's integrates with PyMuPDF to parse PDF documents into LangChain Document objects.

is a powerful library designed to handle various unstructured and semi-structured document formats. It excels at automatically identifying and categorizing different components within documents. Currently supports loading text files, PowerPoints, HTML, PDFs, images, and more.

LangChain's integrates with Unstructured to parse PDF documents into LangChain Document objects.

LangChain's integrates with to parse PDF documents into LangChain Document objects.

is a specialized Python library focused on text extraction and layout analysis from PDF documents.

LangChain's integrates with PDFMiner to parse PDF documents into LangChain Document objects.

This method allows you to parse the output HTML content through to get more structured and richer information about font size, page numbers, PDF header/footer, etc. which can help you semantically split the text into sections.

LangChain: How to load PDFs
Environment Setup
langchain-opentutorial
Portable Document Format (PDF)
Document
sample file
PyPDF
PyPDFLoader
PyMuPDF
PyMuPDFLoader
Unstructured
UnstructuredPDFLoader
PyPDFium2Loader
PyPDFium2
PDFMiner
PDFMinerLoader
BeautifulSoup
Yejin Park
Yun Eun
MinJi Kang
Yejin Park
LangChain Open Tutorial
Overview
Environment Setup
How to load PDFs
PyPDF
PyMuPDF
Unstructured
PyPDFium2
PDFMiner
PDFPlumber