UpstageDocumentParseLoader

Open in ColabOpen in GitHub

Overview

The UpstageDocumentParseLoader is a robust document analysis tool designed by Upstage that seamlessly integrates with the LangChain framework as a document loader. It specializes in transforming documents into structured HTML by analyzing their layout and content.

Key Features :

  • Comprehensive Layout Analysis : Analyzes and identifies structural elements like headings, paragraphs, tables, and images across various document formats (e.g., PDFs, images).

  • Automated Structural Recognition : Automatically detects and serializes document elements based on reading order for accurate conversion to HTML.

  • Optional OCR Support : Includes optical character recognition for handling scanned or image-based documents. The OCR mode supports:

    force : Extracts text from images using OCR.

    auto : Extracts text from PDFs (throws an error if the input is not in PDF format).

By recognizing and preserving the relationships between document elements, the UpstageDocumentParseLoader enables precise and context-aware document analysis.

Migration from Layout Analysis : Upstage has launched Document Parse to replace Layout Analysis! Document Parse now supports a wider range of document types, markdown output, chart detection, equation recognition, and additional features planned for upcoming releases. The last version of Layout Analysis, layout-analysis-0.4.0, will be officially discontinued by November 10, 2024.

Table of Contents

Key Changes from Layout Analysis

Changes to Existing Options :

  1. use_ocr → ocr

    use_ocr option has been replaced with ocr . Instead of True/False , it now accepts force or auto for more precise control.

  2. output_type → output_format

    output_type option has been renamed to output_format for specifying the format of the output.

  3. exclude → base64_encoding

    The exclude option has been replaced with base64_encoding . While exclude was used to exclude specific elements from the output, base64_encoding specifies whether to encode elements of certain categories in Base64.

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

API Key Configuration

To use UpstageDocumentParseLoader , you need to obtain a Upstage API key.

Once you have your API key, set it as the value for the variable UPSTAGE_API_KEY .

You can alternatively set UPSTAGE_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set UPSTAGE_API_KEY in previous steps.

UpstageDocumentParseLoader Key Parameters

  • file_path : Path(s) to the document(s) to be analyzed

  • split : Document splitting mode [default: 'none', 'element', 'page']

  • model : Model name for document parsing [default: 'document-parse']

  • ocr : OCR mode ["force" (always OCR), "auto" (PDF-only)]

  • output_format : Format of the analysis results [default: 'html', 'text', 'markdown']

  • coordinates : Include OCR coordinates in the output [default: True]

  • base64_encoding : List of element categories to be base64-encoded ['paragraph', 'table', 'figure', 'header', 'footer', 'list', 'chart', '...']

Usage Example

Let's try running a code example here using UpstageDocumentParseLoader .

Data Preparation

In this tutorial, we will use the following pdf file:

After downloading the PDF file from the provided link, create a data folder in the current directory and save the PDF file into that folder.

Last updated