PDF Loader
Author: Yejin Park
Peer Review : Yun Eun, MinJi Kang
Author: Yejin Park
This is a part of LangChain Open Tutorial
Overview
This tutorial covers various PDF processing methods using LangChain and popular PDF libraries.
PDF processing is essential for extracting and analyzing text data from PDF documents.
In this tutorial, we will explore different PDF loaders and their capabilities while working with LangChain's document processing framework.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
How to load PDFs
Portable Document Format (PDF), a file format standardized by ISO 32000, was developed by Adobe in 1992 for presenting documents, which include text formatting and images in a way that is independent of application software, hardware, and operating systems.
This guide covers how to load a PDF document into the LangChain Document format. This format will be used downstream.
LangChain integrates with a variety of PDF parsers. Some are simple and relatively low-level, while others support OCR and image processing or perform advanced document layout analysis.
The right choice depends on your application.
We will demonstrate these approaches on a sample file. Download the sample file and copy it to your data folder.
PyPDF
PyPDF is one of the most widely used Python libraries for PDF processing.
Here we use PyPDF to load the PDF as an list of Document objects
LangChain's PyPDFLoader integrates with PyPDF to parse PDF documents into LangChain Document objects.
The load_and_split() method allows customizing how documents are chunked by passing a text splitter object, making it more flexible for different use cases.
PyPDF(OCR)
Some PDFs contain text images within scanned documents or pictures. You can also use the rapidocr-onnxruntime package to extract text from images.
PyPDF Directory
Import all PDF documents from directory.
PyMuPDF
PyMuPDF is speed optimized and includes detailed metadata about the PDF and its pages. It returns one document per page.
LangChain's PyMuPDFLoader integrates with PyMuPDF to parse PDF documents into LangChain Document objects.
Unstructured
Unstructured is a powerful library designed to handle various unstructured and semi-structured document formats. It excels at automatically identifying and categorizing different components within documents. Currently supports loading text files, PowerPoints, HTML, PDFs, images, and more.
LangChain's UnstructuredPDFLoader integrates with Unstructured to parse PDF documents into LangChain Document objects.
Internally, unstructured creates different "elements" for each chunk of text. By default, these are combined, but can be easily separated by specifying mode="elements".
See the full set of element types for this particular article.
PyPDFium2
LangChain's PyPDFium2Loader integrates with PyPDFium2 to parse PDF documents into LangChain Document objects.
Note: When using PyPDFium2Loader, you may notice warning messages related to get_text_range(). These warnings are part of the library's internal operations and do not affect the PDF processing
functionality. You can safely proceed with the tutorial despite these warnings, as they are
a normal part of the development environment and do not impact the learning objectives.
PDFMiner
PDFMiner is a specialized Python library focused on text extraction and layout analysis from PDF documents.
LangChain's PDFMinerLoader integrates with PDFMiner to parse PDF documents into LangChain Document objects.
Using PDFMiner to generate HTML text
This method allows you to parse the output HTML content through BeautifulSoup to get more structured and richer information about font size, page numbers, PDF header/footer, etc. which can help you semantically split the text into sections.
Last updated