HWP (Hangeul) Loader
Author: Sunyoung Park (architectyou)
Peer Review : Suhyun Lee, Kane
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial
Overview
HWP is Hangeul Word Processor developed by Hancom , and it is Korea's representative office software.
It uses the .hwp file extension and is widely used in Businesses, Schools, and Government Institutions, and more.
Therefore, if you're a developer in South Korea, you've likely had (or will have) experience dealing with .hwp documents.
Unfortunately, it's not yet integrated with LangChain, so we'll need to use a custom-implemented HWPLoader
with langchain-teddynote
and langchain-opentutorial
.
In this tutorial, we'll implement a HWPLoader
that can load .hwp files and extract text from them.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
%%capture --no-stderr
%pip install langchain-opentutorial langchain-teddynote
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langchain-teddynote",
],
verbose=False,
upgrade=False,
)
[notice] A new release of pip is available: 23.3.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
HWP Loader Instantiate
You can instantiate HWP Loader with HWPLoader
class.
from langchain_teddynote.document_loaders import HWPLoader
loader = HWPLoader(file_path="data/Regulations_of_the_Establishment_and_Operation_of_the_National_Artificial_Intelligence_Committee.hwp")
Loader
You can load the document with load
method.
docs = loader.load()
print(docs[0].page_content[:1000])
Regulations on the Establishment and Operation of the National Artifical Intelligence Committee[Effective Augst 6, 2024] [Presidential Decree No. 34787, Enacted August 6, 2024]Regulations on the Establishment and Operation of the National Artificial Intelligence Committee Ministry of Government Legislation- / - National Statutory Information Center Reason for Enactment [Enactment]◇ Purpose To establish the National Artificial Intelligence Committee under the President to strengthen national competitiveness, protect national interests, and improve the quality of life for citizens by promoting the artificial intelligence industry and creating a trustworthy AI usage environment.◇ Main Contents A. Establishment and Functions of the National AI Committee (Article 2) 1) The National AI Committee shall be established under the President to efficiently deliberate and coordinate major policies for promoting the AI industry and establishing a foundation of trust in AI. 2) The Commit
len(docs) # Check the number of documents
1
print(docs[0].metadata) # Information about the document
{'source': 'data/Regulations_of_the_Establishment_and_Operation_of_the_National_Artificial_Intelligence_Committee.hwp'}
Last updated