HWP (Hangeul) Loader

Overview

HWP is Hangeul Word Processor developed by Hancom , and it is Korea's representative office software.

It uses the .hwp file extension and is widely used in Businesses, Schools, and Government Institutions, and more.

Therefore, if you're a developer in South Korea, you've likely had (or will have) experience dealing with .hwp documents.

Unfortunately, it's not yet integrated with LangChain, so we'll need to use a custom-implemented HWPLoader with langchain-teddynote and langchain-opentutorial .

In this tutorial, we'll implement a HWPLoader that can load .hwp files and extract text from them.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial langchain-teddynote
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain-teddynote",
    ],
    verbose=False,
    upgrade=False,
)
    [notice] A new release of pip is available: 23.3.2 -> 24.3.1
    [notice] To update, run: pip install --upgrade pip

HWP Loader Instantiate

You can instantiate HWP Loader with HWPLoader class.

from langchain_teddynote.document_loaders import HWPLoader

loader = HWPLoader(file_path="data/Regulations_of_the_Establishment_and_Operation_of_the_National_Artificial_Intelligence_Committee.hwp")

Loader

You can load the document with load method.

docs = loader.load()

print(docs[0].page_content[:1000])
Regulations on the Establishment and Operation of the National Artifical Intelligence Committee[Effective Augst 6, 2024] [Presidential Decree No. 34787, Enacted August 6, 2024]Regulations on the Establishment and Operation of the National Artificial Intelligence Committee Ministry of Government Legislation-  /  - National Statutory Information Center    Reason for Enactment [Enactment]◇ Purpose  To establish the National Artificial Intelligence Committee under the President to strengthen national competitiveness, protect national interests, and improve the quality of life for citizens by promoting the artificial intelligence industry and creating a trustworthy AI usage environment.◇ Main Contents  A. Establishment and Functions of the National AI Committee (Article 2)    1) The National AI Committee shall be established under the President to efficiently deliberate and coordinate major policies for promoting the AI industry and establishing a foundation of trust in AI.    2) The Commit
len(docs) # Check the number of documents
1
print(docs[0].metadata) # Information about the document
{'source': 'data/Regulations_of_the_Establishment_and_Operation_of_the_National_Artificial_Intelligence_Committee.hwp'}

Last updated