UpstageDocumentParseLoader

Author: Taylor(Jihyun Kim)
Peer Review : JoonHo Kim, Jaemin Hong, leebeanbin, Dooil Kwak
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial

Overview

The UpstageDocumentParseLoader is a robust document analysis tool designed by Upstage that seamlessly integrates with the LangChain framework as a document loader. It specializes in transforming documents into structured HTML by analyzing their layout and content.

Key Features :

Comprehensive Layout Analysis : Analyzes and identifies structural elements like headings, paragraphs, tables, and images across various document formats (e.g., PDFs, images).
Automated Structural Recognition : Automatically detects and serializes document elements based on reading order for accurate conversion to HTML.
Optional OCR Support : Includes optical character recognition for handling scanned or image-based documents. The OCR mode supports:
force : Extracts text from images using OCR.
auto : Extracts text from PDFs (throws an error if the input is not in PDF format).

By recognizing and preserving the relationships between document elements, the UpstageDocumentParseLoader enables precise and context-aware document analysis.

Migration from Layout Analysis : Upstage has launched Document Parse to replace Layout Analysis! Document Parse now supports a wider range of document types, markdown output, chart detection, equation recognition, and additional features planned for upcoming releases. The last version of Layout Analysis, layout-analysis-0.4.0, will be officially discontinued by November 10, 2024.

Key Changes from Layout Analysis

Changes to Existing Options :

use_ocr → ocr
use_ocr option has been replaced with ocr . Instead of True/False , it now accepts force or auto for more precise control.
output_type → output_format
output_type option has been renamed to output_format for specifying the format of the output.
exclude → base64_encoding
The exclude option has been replaced with base64_encoding . While exclude was used to exclude specific elements from the output, base64_encoding specifies whether to encode elements of certain categories in Base64.

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

API Key Configuration

To use UpstageDocumentParseLoader , you need to obtain a Upstage API key.

Once you have your API key, set it as the value for the variable UPSTAGE_API_KEY .

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_upstage",
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "UPSTAGE_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "12-UpstageDocumentParseLoader",
    }
)

Environment variables have been set successfully.

You can alternatively set UPSTAGE_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set UPSTAGE_API_KEY in previous steps.

from dotenv import load_dotenv

load_dotenv(override=True)

True

import os
import nest_asyncio

# Allow async
nest_asyncio.apply()