PDF Loader

Open in Colab Open in GitHub

Overview

This tutorial covers various PDF processing methods using LangChain and popular PDF libraries.

PDF processing is essential for extracting and analyzing text data from PDF documents.

In this tutorial, we will explore different PDF loaders and their capabilities while working with LangChain's document processing framework.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

How to load PDFs

Portable Document Format (PDF), a file format standardized by ISO 32000, was developed by Adobe in 1992 for presenting documents, which include text formatting and images in a way that is independent of application software, hardware, and operating systems.

This guide covers how to load a PDF document into the LangChain Document format. This format will be used downstream.

LangChain integrates with a variety of PDF parsers. Some are simple and relatively low-level, while others support OCR and image processing or perform advanced document layout analysis.

The right choice depends on your application.

We will demonstrate these approaches on a sample file. Download the sample file and copy it to your data folder.

PyPDF

PyPDF is one of the most widely used Python libraries for PDF processing.

Here we use PyPDF to load the PDF as an list of Document objects

LangChain's PyPDFLoader integrates with PyPDF to parse PDF documents into LangChain Document objects.

The load_and_split() method allows customizing how documents are chunked by passing a text splitter object, making it more flexible for different use cases.

PyPDF(OCR)

Some PDFs contain text images within scanned documents or pictures. You can also use the rapidocr-onnxruntime package to extract text from images.

PyPDF Directory

Import all PDF documents from directory.

PyMuPDF

PyMuPDF is speed optimized and includes detailed metadata about the PDF and its pages. It returns one document per page.

LangChain's PyMuPDFLoader integrates with PyMuPDF to parse PDF documents into LangChain Document objects.

Unstructured

Unstructured is a powerful library designed to handle various unstructured and semi-structured document formats. It excels at automatically identifying and categorizing different components within documents. Currently supports loading text files, PowerPoints, HTML, PDFs, images, and more.

LangChain's UnstructuredPDFLoader integrates with Unstructured to parse PDF documents into LangChain Document objects.

Internally, unstructured creates different "elements" for each chunk of text. By default, these are combined, but can be easily separated by specifying mode="elements".

See the full set of element types for this particular article.

PyPDFium2

LangChain's PyPDFium2Loader integrates with PyPDFium2 to parse PDF documents into LangChain Document objects.

Note: When using PyPDFium2Loader, you may notice warning messages related to get_text_range(). These warnings are part of the library's internal operations and do not affect the PDF processing functionality. You can safely proceed with the tutorial despite these warnings, as they are a normal part of the development environment and do not impact the learning objectives.

PDFMiner

PDFMiner is a specialized Python library focused on text extraction and layout analysis from PDF documents.

LangChain's PDFMinerLoader integrates with PDFMiner to parse PDF documents into LangChain Document objects.

Using PDFMiner to generate HTML text

This method allows you to parse the output HTML content through BeautifulSoup to get more structured and richer information about font size, page numbers, PDF header/footer, etc. which can help you semantically split the text into sections.

Last updated