Microsoft Word(doc, docx) With Langchain

Open in ColabOpen in GitHub

Overview

This tutorial covers two methods for loading Microsoft Word documents into a document format that can be used in RAG.

We will demonstrate the usage of Docx2txtLoader and UnstructuredWordDocumentLoader , exploring their functionalities to process and load .docx files effectively.

Additionally, we provide a comparison to help users choose the appropriate loader for their requirements.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Comparison of docx Loading Methods

Feature

Docx2txtLoader

UnstructuredWordDocumentLoader

Base Library

docx2txt

Unstructured

Speed

Fast

Relatively slow

Memory Usage

Efficient

Relatively high

Installation Dependencies

Lightweight (only requires docx2txt)

Heavy (requires multiple dependency packages)

Docx2txtLoader

Used Library : A lightweight Python module such as docx2txt for text extraction.

Key Features :

  • Extracts text from .docx files quickly and simply.

  • Suitable for efficient and straightforward tasks.

Use Case :

  • When you need to quickly retrieve text data from .docx files.

UnstructuredWordDocumentLoader

Used Library : A comprehensive document analysis library called unstructured .

Key Features :

  • Capable of understanding the structure of a document, such as titles and body, and separating them into distinct elements.

  • Allows hierarchical representation and detailed processing of documents.

  • Extracts meaningful information from unstructured data and transforms it into structured formats.

Use Case :

  • When you need to extract text while preserving the document's structure, formatting, and metadata.

  • Suitable for handling complex document structures or converting unstructured data into structured formats.

Parameter

Option

Description

mode

single (default)

Returns the entire document as a single Document object.

elements

Splits the document into elements (e.g., title, body) and returns each as a Document object.

strategy

None (default)

No specific strategy is applied.

fast

Prioritizes speed (may reduce accuracy).

hi_res

Prioritizes high accuracy (slower processing).

include_page_breaks

True (default)

Detects page breaks and adds PageBreak elements.

False

Ignores page breaks.

infer_table_structure

True (default)

Infers table structure and includes it in HTML format.

False

Does not infer table structure.

starting_page_number

1 (default)

Specifies the starting page number of the document.

mode: Single (default)

In this mode, the entire document is returned as a single LangChain Document object. In other words, all the content of the document is contained within a single object.

mode: elements

The document is divided into individual elements, such as Title and NarrativeText. Each element is returned as a separate Document object, allowing for more detailed analysis or processing of the document's structure.

Efficient Document Loader Configuration with Various Parameter Combinations

By combining various parameters, you can configure a document loader that fits your specific needs efficiently. Adjusting settings such as mode , strategy , and include_page_breaks allows for tailored handling of different document structures and processing requirements.

Last updated