Microsoft Word(doc, docx) With Langchain

Author: Suhyun Lee
Peer Review: Sunyoung Park (architectyou), Teddy Lee
Proofread : Youngjun cho
This is a part of LangChain Open Tutorial

Overview

This tutorial covers two methods for loading Microsoft Word documents into a document format that can be used in RAG.

We will demonstrate the usage of Docx2txtLoader and UnstructuredWordDocumentLoader , exploring their functionalities to process and load .docx files effectively.

Additionally, we provide a comparison to help users choose the appropriate loader for their requirements.

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    ["langchain", "langchain_community", "docx2txt", "unstructured", "python-docx"],
    verbose=False,
    upgrade=False,
)

Comparison of docx Loading Methods

Feature

Docx2txtLoader

UnstructuredWordDocumentLoader

Base Library

docx2txt

Unstructured

Speed

Fast

Relatively slow

Memory Usage

Efficient

Relatively high

Installation Dependencies

Lightweight (only requires docx2txt)

Heavy (requires multiple dependency packages)

Docx2txtLoader

Used Library : A lightweight Python module such as docx2txt for text extraction.

Key Features :

Extracts text from .docx files quickly and simply.
Suitable for efficient and straightforward tasks.

Use Case :

When you need to quickly retrieve text data from .docx files.

from langchain_community.document_loaders import Docx2txtLoader

# Initialize the document loader
loader = Docx2txtLoader("data/sample-word-document_eng.docx")

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the document
print(f"Document Metadata: {docs[0].metadata}\n")

# Note: The entire docx file is converted into a single document.
# It needs to be split into smaller parts using a text splitter.
print("Document Content")
print(docs[0])

Document Count: 1
    
    Type of loader: 
    
    Document Metadata: {'source': 'data/sample-word-document_eng.docx'}
    
    Document Content
    page_content='Semantic Search
    
    
    
    Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.
    
    Example: When a user searches for "planets in the solar system," the search returns information about planets like "Jupiter" and "Mars."
    
    Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining
    
    
    
    Embedding
    
    
    
    Definition: Embedding refers to the process of transforming textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to understand and process text effectively.
    
    Example: The word "apple" might be represented as a vector like [0.65, -0.23, 0.17].
    
    Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
    
    
    
    Token
    
    
    
    Definition: A token is a smaller unit obtained by splitting a piece of text. This unit can be a word, a sentence, or a phrase.
    
    Example: The sentence "I go to school" can be tokenized into "I," "go," and "to school."
    
    Related Keywords: Tokenization, Natural Language Processing (NLP), Parsing
    
    
    
    Tokenizer
    
    
    
    Definition: A tokenizer is a tool that splits textual data into tokens. It is used in natural language processing to preprocess data.
    
    Example: The sentence "I love programming." is split into ["I", "love", "programming", "."].
    
    Related Keywords: Tokenization, Natural Language Processing (NLP), Parsing
    
    
    
    VectorStore
    
    
    
    Definition: A VectorStore is a system for storing data in vector format. It is used for tasks like search, classification, and other data analysis operations.
    
    Example: Storing word embedding vectors in a database for fast retrieval.
    
    Related Keywords: Embedding, Database, Vectorization
    
    
    
    SQL
    
    
    
    Definition: SQL (Structured Query Language) is a programming language used to manage data in databases. It allows operations like querying, updating, inserting, and deleting data.
    
    Example: SELECT * FROM users WHERE age > 18; retrieves information about users older than 18.
    
    Related Keywords: Database, Query, Data Management
    
    
    
    CSV
    
    
    
    Definition: CSV (Comma-Separated Values) is a file format used to store data, where each value is separated by a comma. It is commonly used for saving and exchanging tabular data.
    
    Example: A CSV file with headers "Name," "Age," and "Occupation" might include data like "Hong Gil-dong, 30, Developer."
    
    Related Keywords: Data Format, File Handling, Data Exchange
    
    
    
    JSON
    
    
    
    Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses human-readable text to represent data objects. It is widely used for data communication between systems.
    
    Example: {"Name": "Hong Gil-dong", "Age": 30, "Occupation": "Developer"} is an example of JSON data.
    
    Related Keywords: Data Exchange, Web Development, API
    
    
    
    Transformer
    
    
    
    Definition: A Transformer is a type of deep learning model used in natural language processing for tasks like translation, summarization, and text generation. It is based on the Attention mechanism.
    
    Example: Google Translate uses Transformer models to perform translations between various languages.
    
    Related Keywords: Deep Learning, Natural Language Processing (NLP), Attention
    
    
    
    HuggingFace
    
    
    
    Definition: HuggingFace is a library offering a variety of pre-trained models and tools for natural language processing, enabling researchers and developers to easily perform NLP tasks.
    
    Example: HuggingFace's Transformers library can be used for tasks such as sentiment analysis and text generation.
    
    Related Keywords: Natural Language Processing (NLP), Deep Learning, Library
    
    
    
    Digital Transformation
    
    
    
    Definition: Digital transformation is the process of leveraging technology to innovate and enhance a company’s services, culture, and operations. It focuses on improving business models and increasing competitiveness through digital technologies.
    
    Example: A company adopting cloud computing to revolutionize data storage and processing is an example of digital transformation.
    
    Related Keywords: Innovation, Technology, Business Model
    
    
    
    Crawling
    
    
    
    Definition: Crawling is the process of automatically visiting web pages to collect data. It is commonly used for search engine optimization and data analysis.
    
    Example: Google’s search engine crawls websites on the internet to collect and index content.
    
    Related Keywords: Data Collection, Web Scraping, Search Engine
    
    
    
    Word2Vec
    
    
    
    Definition: Word2Vec is a natural language processing technique that maps words to a vector space, representing semantic relationships between words. It generates vectors based on the contextual similarity of words.
    
    Example: In a Word2Vec model, "king" and "queen" are represented as vectors that are close to each other in the vector space.
    
    Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
    
    
    
    LLM (Large Language Model)
    
    
    
    Definition: LLM refers to large-scale language models trained on extensive text datasets. These models are used for a variety of natural language understanding and generation tasks.
    
    Example: OpenAI’s GPT series is a prominent example of large language models.
    
    Related Keywords: Natural Language Processing (NLP), Deep Learning, Text Generation
    
    
    
    FAISS (Facebook AI Similarity Search)
    
    
    
    Definition: FAISS is a high-speed similarity search library developed by Facebook, designed to efficiently search for similar vectors in large vector datasets.
    
    Example: FAISS can be used to quickly find similar images among millions of image vectors.
    
    Related Keywords: Vector Search, Machine Learning, Database Optimization
    
    
    
    Open Source
    
    
    
    Definition: Open source refers to software whose source code is openly available for anyone to use, modify, and distribute freely. It plays a significant role in fostering collaboration and innovation.
    
    Example: The Linux operating system is a notable open-source project.
    
    Related Keywords: Software Development, Community, Technical Collaboration
    
    
    
    Structured Data
    
    
    
    Definition: Structured data is data that is organized according to a predefined format or schema, making it easy to search and analyze in systems like databases and spreadsheets.
    
    Example: A customer information table stored in a relational database is an example of structured data.
    
    Related Keywords: Database, Data Analysis, Data Modeling
    
    
    
    Parser
    
    
    
    Definition: A parser is a tool that analyzes input data (e.g., strings, files) and converts it into a structured format. It is commonly used in syntax analysis for programming languages or processing file data.
    
    Example: Parsing an HTML document to create the DOM structure of a web page is an example of parsing.
    
    Related Keywords: Syntax Analysis, Compiler, Data Processing
    
    
    
    TF-IDF (Term Frequency-Inverse Document Frequency)
    
    
    
    Definition: TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It considers both the frequency of the word in the document and its rarity across all documents.
    
    Example: Words that appear frequently in a document but rarely in the corpus will have a high TF-IDF value.
    
    Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining
    
    
    
    Deep Learning
    
    
    
    Definition: Deep learning is a subset of machine learning that uses artificial neural networks to solve complex problems. It focuses on learning high-level representations from data.
    
    Example: Deep learning models are applied in tasks like image recognition, speech recognition, and natural language processing.
    
    Related Keywords: Artificial Neural Networks, Machine Learning, Data Analysis
    
    
    
    Schema
    
    
    
    Definition: A schema defines the structure of a database or file, providing a blueprint for how data is stored and organized.
    
    Example: A relational database table schema specifies column names, data types, and key constraints.
    
    Related Keywords: Database, Data Modeling, Data Management
    
    
    
    DataFrame
    
    
    
    Definition: A DataFrame is a tabular data structure composed of rows and columns, commonly used for data analysis and processing.
    
    Example: In the pandas library, a DataFrame can contain columns of various data types, making data manipulation and analysis straightforward.
    
    Related Keywords: Data Analysis, pandas, Data Processing
    
    
    
    Attention Mechanism
    
    
    
    Definition: The Attention mechanism is a deep learning technique that allows models to focus on the most relevant parts of input data. It is mainly used in sequence data, such as text or time series.
    
    Example: In translation models, the Attention mechanism helps the model focus on the crucial parts of the input sentence to produce accurate translations.
    
    Related Keywords: Deep Learning, Natural Language Processing (NLP), Sequence Modeling
    
    
    
    Pandas
    
    
    
    Definition: Pandas is a Python library that provides tools for data analysis and manipulation. It enables efficient execution of data-related tasks.
    
    Example: Using pandas, you can read a CSV file, clean the data, and perform various analyses.
    
    Related Keywords: Data Analysis, Python, Data Processing
    
    
    
    GPT (Generative Pretrained Transformer)
    
    
    
    Definition: GPT is a generative language model pretrained on large datasets, designed for various text-based tasks. It generates natural language based on the provided input.
    
    Example: A chatbot that generates detailed answers to user queries can leverage the GPT model.
    
    Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning
    
    
    
    InstructGPT
    
    
    
    Definition: InstructGPT is a GPT model optimized to perform specific tasks based on user instructions. It is designed to generate more accurate and contextually relevant results.
    
    Example: If a user provides an instruction like "draft an email," InstructGPT generates an email based on the provided context.
    
    Related Keywords: Artificial Intelligence (AI), Natural Language Understanding (NLU), Instruction-Based Processing
    
    
    
    Keyword Search
    
    
    
    Definition: Keyword search is the process of finding information based on user-entered keywords. It is the fundamental search method used in most search engines and database systems.
    
    Example: When a user searches for "coffee shop Seoul," the search returns a list of relevant coffee shops in Seoul.
    
    Related Keywords: Search Engine, Data Search, Information Retrieval
    
    
    
    Page Rank
    
    
    
    Definition: Page Rank is an algorithm that evaluates the importance of web pages, primarily used to determine the ranking of search engine results. It analyzes the link structure between web pages to assess their importance.
    
    Example: Google’s search engine uses the Page Rank algorithm to determine the order of its search results.
    
    Related Keywords: Search Engine Optimization (SEO), Web Analytics, Link Analysis
    
    
    
    Data Mining
    
    
    
    Definition: Data mining is the process of extracting useful information from large datasets by utilizing techniques such as statistics, machine learning, and pattern recognition.
    
    Example: Analyzing customer purchase data to develop sales strategies in retail is a common example of data mining.
    
    Related Keywords: Big Data, Pattern Recognition, Predictive Analytics
    
    
    
    Multimodal
    
    
    
    Definition: Multimodal refers to technologies that combine and process multiple types of data modalities, such as text, images, and audio. It is used to extract or predict richer and more accurate information through the interaction of different data formats.
    
    Example: A system that analyzes both images and descriptive text to perform more accurate image classification is an example of multimodal technology.
    
    Related Keywords: Data Fusion, Artificial Intelligence (AI), Deep Learning' metadata={'source': 'data/sample-word-document_eng.docx'}

UnstructuredWordDocumentLoader

Used Library : A comprehensive document analysis library called unstructured .

Key Features :

Capable of understanding the structure of a document, such as titles and body, and separating them into distinct elements.
Allows hierarchical representation and detailed processing of documents.
Extracts meaningful information from unstructured data and transforms it into structured formats.

Use Case :

When you need to extract text while preserving the document's structure, formatting, and metadata.
Suitable for handling complex document structures or converting unstructured data into structured formats.

Parameter

Option

Description

mode

single (default)

Returns the entire document as a single Document object.

elements

Splits the document into elements (e.g., title, body) and returns each as a Document object.

strategy

None (default)

No specific strategy is applied.

fast

Prioritizes speed (may reduce accuracy).

hi_res

Prioritizes high accuracy (slower processing).

include_page_breaks

True (default)

Detects page breaks and adds PageBreak elements.

False

Ignores page breaks.

infer_table_structure

True (default)

Infers table structure and includes it in HTML format.

False

Does not infer table structure.

starting_page_number

1 (default)

Specifies the starting page number of the document.

mode: Single (default)

In this mode, the entire document is returned as a single LangChain Document object. In other words, all the content of the document is contained within a single object.

from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader
loader = UnstructuredWordDocumentLoader("data/sample-word-document_eng.docx")

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the document
print(f"Document Metadata: {docs[0].metadata}\n")

# Note: The entire docx file is converted into a single document.
# It needs to be split into smaller parts using a text splitter.
print("Document Content")
print(docs[0])

Document Count: 1
    
    Type of loader: 
    
    Document Metadata: {'source': 'data/sample-word-document_eng.docx'}
    
    Document Content
    page_content='Semantic Search
    
    Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.
    
    Example: When a user searches for "planets in the solar system," the search returns information about planets like "Jupiter" and "Mars."
    
    Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining
    
    Embedding
    
    Definition: Embedding refers to the process of transforming textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to understand and process text effectively.
    
    Example: The word "apple" might be represented as a vector like [0.65, -0.23, 0.17].
    
    Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
    
    Token
    
    Definition: A token is a smaller unit obtained by splitting a piece of text. This unit can be a word, a sentence, or a phrase.
    
    Example: The sentence "I go to school" can be tokenized into "I," "go," and "to school."
    
    Related Keywords: Tokenization, Natural Language Processing (NLP), Parsing
    
    Tokenizer
    
    Definition: A tokenizer is a tool that splits textual data into tokens. It is used in natural language processing to preprocess data.
    
    Example: The sentence "I love programming." is split into ["I", "love", "programming", "."].
    
    Related Keywords: Tokenization, Natural Language Processing (NLP), Parsing
    
    VectorStore
    
    Definition: A VectorStore is a system for storing data in vector format. It is used for tasks like search, classification, and other data analysis operations.
    
    Example: Storing word embedding vectors in a database for fast retrieval.
    
    Related Keywords: Embedding, Database, Vectorization
    
    SQL
    
    
    
    Definition: SQL (Structured Query Language) is a programming language used to manage data in databases. It allows operations like querying, updating, inserting, and deleting data.
    
    Example: SELECT * FROM users WHERE age > 18; retrieves information about users older than 18.
    
    Related Keywords: Database, Query, Data Management
    
    CSV
    
    Definition: CSV (Comma-Separated Values) is a file format used to store data, where each value is separated by a comma. It is commonly used for saving and exchanging tabular data.
    
    Example: A CSV file with headers "Name," "Age," and "Occupation" might include data like "Hong Gil-dong, 30, Developer."
    
    Related Keywords: Data Format, File Handling, Data Exchange
    
    JSON
    
    Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses human-readable text to represent data objects. It is widely used for data communication between systems.
    
    Example: {"Name": "Hong Gil-dong", "Age": 30, "Occupation": "Developer"} is an example of JSON data.
    
    Related Keywords: Data Exchange, Web Development, API
    
    Transformer
    
    Definition: A Transformer is a type of deep learning model used in natural language processing for tasks like translation, summarization, and text generation. It is based on the Attention mechanism.
    
    Example: Google Translate uses Transformer models to perform translations between various languages.
    
    Related Keywords: Deep Learning, Natural Language Processing (NLP), Attention
    
    HuggingFace
    
    Definition: HuggingFace is a library offering a variety of pre-trained models and tools for natural language processing, enabling researchers and developers to easily perform NLP tasks.
    
    Example: HuggingFace's Transformers library can be used for tasks such as sentiment analysis and text generation.
    
    Related Keywords: Natural Language Processing (NLP), Deep Learning, Library
    
    Digital Transformation
    
    Definition: Digital transformation is the process of leveraging technology to innovate and enhance a company’s services, culture, and operations. It focuses on improving business models and increasing competitiveness through digital technologies.
    
    Example: A company adopting cloud computing to revolutionize data storage and processing is an example of digital transformation.
    
    Related Keywords: Innovation, Technology, Business Model
    
    Crawling
    
    Definition: Crawling is the process of automatically visiting web pages to collect data. It is commonly used for search engine optimization and data analysis.
    
    Example: Google’s search engine crawls websites on the internet to collect and index content.
    
    Related Keywords: Data Collection, Web Scraping, Search Engine
    
    Word2Vec
    
    Definition: Word2Vec is a natural language processing technique that maps words to a vector space, representing semantic relationships between words. It generates vectors based on the contextual similarity of words.
    
    Example: In a Word2Vec model, "king" and "queen" are represented as vectors that are close to each other in the vector space.
    
    Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
    
    LLM (Large Language Model)
    
    Definition: LLM refers to large-scale language models trained on extensive text datasets. These models are used for a variety of natural language understanding and generation tasks.
    
    Example: OpenAI’s GPT series is a prominent example of large language models.
    
    Related Keywords: Natural Language Processing (NLP), Deep Learning, Text Generation
    
    FAISS (Facebook AI Similarity Search)
    
    Definition: FAISS is a high-speed similarity search library developed by Facebook, designed to efficiently search for similar vectors in large vector datasets.
    
    Example: FAISS can be used to quickly find similar images among millions of image vectors.
    
    Related Keywords: Vector Search, Machine Learning, Database Optimization
    
    Open Source
    
    Definition: Open source refers to software whose source code is openly available for anyone to use, modify, and distribute freely. It plays a significant role in fostering collaboration and innovation.
    
    Example: The Linux operating system is a notable open-source project.
    
    Related Keywords: Software Development, Community, Technical Collaboration
    
    Structured Data
    
    Definition: Structured data is data that is organized according to a predefined format or schema, making it easy to search and analyze in systems like databases and spreadsheets.
    
    Example: A customer information table stored in a relational database is an example of structured data.
    
    Related Keywords: Database, Data Analysis, Data Modeling
    
    Parser
    
    Definition: A parser is a tool that analyzes input data (e.g., strings, files) and converts it into a structured format. It is commonly used in syntax analysis for programming languages or processing file data.
    
    Example: Parsing an HTML document to create the DOM structure of a web page is an example of parsing.
    
    Related Keywords: Syntax Analysis, Compiler, Data Processing
    
    TF-IDF (Term Frequency-Inverse Document Frequency)
    
    Definition: TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It considers both the frequency of the word in the document and its rarity across all documents.
    
    Example: Words that appear frequently in a document but rarely in the corpus will have a high TF-IDF value.
    
    Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining
    
    Deep Learning
    
    Definition: Deep learning is a subset of machine learning that uses artificial neural networks to solve complex problems. It focuses on learning high-level representations from data.
    
    Example: Deep learning models are applied in tasks like image recognition, speech recognition, and natural language processing.
    
    Related Keywords: Artificial Neural Networks, Machine Learning, Data Analysis
    
    Schema
    
    Definition: A schema defines the structure of a database or file, providing a blueprint for how data is stored and organized.
    
    Example: A relational database table schema specifies column names, data types, and key constraints.
    
    Related Keywords: Database, Data Modeling, Data Management
    
    DataFrame
    
    Definition: A DataFrame is a tabular data structure composed of rows and columns, commonly used for data analysis and processing.
    
    Example: In the pandas library, a DataFrame can contain columns of various data types, making data manipulation and analysis straightforward.
    
    Related Keywords: Data Analysis, pandas, Data Processing
    
    Attention Mechanism
    
    Definition: The Attention mechanism is a deep learning technique that allows models to focus on the most relevant parts of input data. It is mainly used in sequence data, such as text or time series.
    
    Example: In translation models, the Attention mechanism helps the model focus on the crucial parts of the input sentence to produce accurate translations.
    
    Related Keywords: Deep Learning, Natural Language Processing (NLP), Sequence Modeling
    
    Pandas
    
    Definition: Pandas is a Python library that provides tools for data analysis and manipulation. It enables efficient execution of data-related tasks.
    
    Example: Using pandas, you can read a CSV file, clean the data, and perform various analyses.
    
    Related Keywords: Data Analysis, Python, Data Processing
    
    GPT (Generative Pretrained Transformer)
    
    Definition: GPT is a generative language model pretrained on large datasets, designed for various text-based tasks. It generates natural language based on the provided input.
    
    Example: A chatbot that generates detailed answers to user queries can leverage the GPT model.
    
    Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning
    
    InstructGPT
    
    Definition: InstructGPT is a GPT model optimized to perform specific tasks based on user instructions. It is designed to generate more accurate and contextually relevant results.
    
    Example: If a user provides an instruction like "draft an email," InstructGPT generates an email based on the provided context.
    
    Related Keywords: Artificial Intelligence (AI), Natural Language Understanding (NLU), Instruction-Based Processing
    
    Keyword Search
    
    Definition: Keyword search is the process of finding information based on user-entered keywords. It is the fundamental search method used in most search engines and database systems.
    
    Example: When a user searches for "coffee shop Seoul," the search returns a list of relevant coffee shops in Seoul.
    
    Related Keywords: Search Engine, Data Search, Information Retrieval
    
    Page Rank
    
    Definition: Page Rank is an algorithm that evaluates the importance of web pages, primarily used to determine the ranking of search engine results. It analyzes the link structure between web pages to assess their importance.
    
    Example: Google’s search engine uses the Page Rank algorithm to determine the order of its search results.
    
    Related Keywords: Search Engine Optimization (SEO), Web Analytics, Link Analysis
    
    Data Mining
    
    Definition: Data mining is the process of extracting useful information from large datasets by utilizing techniques such as statistics, machine learning, and pattern recognition.
    
    Example: Analyzing customer purchase data to develop sales strategies in retail is a common example of data mining.
    
    Related Keywords: Big Data, Pattern Recognition, Predictive Analytics
    
    Multimodal
    
    Definition: Multimodal refers to technologies that combine and process multiple types of data modalities, such as text, images, and audio. It is used to extract or predict richer and more accurate information through the interaction of different data formats.
    
    Example: A system that analyzes both images and descriptive text to perform more accurate image 
    
    
    
    classification is an example of multimodal technology.
    
    Related Keywords: Data Fusion, Artificial Intelligence (AI), Deep Learning' metadata={'source': 'data/sample-word-document_eng.docx'}

mode: elements

The document is divided into individual elements, such as Title and NarrativeText. Each element is returned as a separate Document object, allowing for more detailed analysis or processing of the document's structure.

from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader with "elements" mode
loader = UnstructuredWordDocumentLoader(
    "data/sample-word-document_eng.docx", mode="elements"
)

# Load the document
docs = loader.load()

# Print the number of documents
print(
    f"Document Count: {len(docs)}\n"
)  # Using "elements" mode, each element is converted into a separate Document object

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the first document element
print(f"Document Metadata: {docs[0].metadata}\n")

# Print the content of the first document element
print("Document Content")
print(docs[0])

Document Count: 123
    
    Type of loader: 
    
    Document Metadata: {'source': 'data/sample-word-document_eng.docx', 'category_depth': 0, 'file_directory': 'data', 'filename': 'sample-word-document_eng.docx', 'last_modified': '2024-12-29T15:35:49', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'ae1fc619a1205c0a9c3f6876535ffc46'}
    
    Document Content
    page_content='Semantic Search' metadata={'source': 'data/sample-word-document_eng.docx', 'category_depth': 0, 'file_directory': 'data', 'filename': 'sample-word-document_eng.docx', 'last_modified': '2024-12-29T15:35:49', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'ae1fc619a1205c0a9c3f6876535ffc46'}

# Print the content of the second document
print(docs[1].page_content)

Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.

Efficient Document Loader Configuration with Various Parameter Combinations

By combining various parameters, you can configure a document loader that fits your specific needs efficiently. Adjusting settings such as mode , strategy , and include_page_breaks allows for tailored handling of different document structures and processing requirements.

from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader with specific parameters
loader = UnstructuredWordDocumentLoader(
    "data/sample-word-document_eng.docx",
    strategy="fast",  # Prioritize fast processing
    include_page_breaks=True,  # Include page breaks as PageBreak elements
    infer_table_structure=True,  # Infer table structures and include in HTML format
    starting_page_number=1,  # Start page numbering from 1
)

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the first document
print(f"Document Metadata: {docs[0].metadata}\n")

# Print the content of the first document
print("Document Content")
print(docs[0])

Document Count: 1
    
    Type of loader: 
    
    Document Metadata: {'source': 'data/sample-word-document_eng.docx'}
    
    Document Content
    page_content='Semantic Search
    
    Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.
    
    Example: When a user searches for "planets in the solar system," the search returns information about planets like "Jupiter" and "Mars."
    
    Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining
    
    Embedding
    
    Definition: Embedding refers to the process of transforming textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to understand and process text effectively.
    
    Example: The word "apple" might be represented as a vector like [0.65, -0.23, 0.17].
    
    Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
    
    Token
    
    Definition: A token is a smaller unit obtained by splitting a piece of text. This unit can be a word, a sentence, or a phrase.
    
    Example: The sentence "I go to school" can be tokenized into "I," "go," and "to school."
    
    Related Keywords: Tokenization, Natural Language Processing (NLP), Parsing
    
    Tokenizer
    
    Definition: A tokenizer is a tool that splits textual data into tokens. It is used in natural language processing to preprocess data.
    
    Example: The sentence "I love programming." is split into ["I", "love", "programming", "."].
    
    Related Keywords: Tokenization, Natural Language Processing (NLP), Parsing
    
    VectorStore
    
    Definition: A VectorStore is a system for storing data in vector format. It is used for tasks like search, classification, and other data analysis operations.
    
    Example: Storing word embedding vectors in a database for fast retrieval.
    
    Related Keywords: Embedding, Database, Vectorization
    
    SQL
    
    
    
    Definition: SQL (Structured Query Language) is a programming language used to manage data in databases. It allows operations like querying, updating, inserting, and deleting data.
    
    Example: SELECT * FROM users WHERE age > 18; retrieves information about users older than 18.
    
    Related Keywords: Database, Query, Data Management
    
    CSV
    
    Definition: CSV (Comma-Separated Values) is a file format used to store data, where each value is separated by a comma. It is commonly used for saving and exchanging tabular data.
    
    Example: A CSV file with headers "Name," "Age," and "Occupation" might include data like "Hong Gil-dong, 30, Developer."
    
    Related Keywords: Data Format, File Handling, Data Exchange
    
    JSON
    
    Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses human-readable text to represent data objects. It is widely used for data communication between systems.
    
    Example: {"Name": "Hong Gil-dong", "Age": 30, "Occupation": "Developer"} is an example of JSON data.
    
    Related Keywords: Data Exchange, Web Development, API
    
    Transformer
    
    Definition: A Transformer is a type of deep learning model used in natural language processing for tasks like translation, summarization, and text generation. It is based on the Attention mechanism.
    
    Example: Google Translate uses Transformer models to perform translations between various languages.
    
    Related Keywords: Deep Learning, Natural Language Processing (NLP), Attention
    
    HuggingFace
    
    Definition: HuggingFace is a library offering a variety of pre-trained models and tools for natural language processing, enabling researchers and developers to easily perform NLP tasks.
    
    Example: HuggingFace's Transformers library can be used for tasks such as sentiment analysis and text generation.
    
    Related Keywords: Natural Language Processing (NLP), Deep Learning, Library
    
    Digital Transformation
    
    Definition: Digital transformation is the process of leveraging technology to innovate and enhance a company’s services, culture, and operations. It focuses on improving business models and increasing competitiveness through digital technologies.
    
    Example: A company adopting cloud computing to revolutionize data storage and processing is an example of digital transformation.
    
    Related Keywords: Innovation, Technology, Business Model
    
    Crawling
    
    Definition: Crawling is the process of automatically visiting web pages to collect data. It is commonly used for search engine optimization and data analysis.
    
    Example: Google’s search engine crawls websites on the internet to collect and index content.
    
    Related Keywords: Data Collection, Web Scraping, Search Engine
    
    Word2Vec
    
    Definition: Word2Vec is a natural language processing technique that maps words to a vector space, representing semantic relationships between words. It generates vectors based on the contextual similarity of words.
    
    Example: In a Word2Vec model, "king" and "queen" are represented as vectors that are close to each other in the vector space.
    
    Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
    
    LLM (Large Language Model)
    
    Definition: LLM refers to large-scale language models trained on extensive text datasets. These models are used for a variety of natural language understanding and generation tasks.
    
    Example: OpenAI’s GPT series is a prominent example of large language models.
    
    Related Keywords: Natural Language Processing (NLP), Deep Learning, Text Generation
    
    FAISS (Facebook AI Similarity Search)
    
    Definition: FAISS is a high-speed similarity search library developed by Facebook, designed to efficiently search for similar vectors in large vector datasets.
    
    Example: FAISS can be used to quickly find similar images among millions of image vectors.
    
    Related Keywords: Vector Search, Machine Learning, Database Optimization
    
    Open Source
    
    Definition: Open source refers to software whose source code is openly available for anyone to use, modify, and distribute freely. It plays a significant role in fostering collaboration and innovation.
    
    Example: The Linux operating system is a notable open-source project.
    
    Related Keywords: Software Development, Community, Technical Collaboration
    
    Structured Data
    
    Definition: Structured data is data that is organized according to a predefined format or schema, making it easy to search and analyze in systems like databases and spreadsheets.
    
    Example: A customer information table stored in a relational database is an example of structured data.
    
    Related Keywords: Database, Data Analysis, Data Modeling
    
    Parser
    
    Definition: A parser is a tool that analyzes input data (e.g., strings, files) and converts it into a structured format. It is commonly used in syntax analysis for programming languages or processing file data.
    
    Example: Parsing an HTML document to create the DOM structure of a web page is an example of parsing.
    
    Related Keywords: Syntax Analysis, Compiler, Data Processing
    
    TF-IDF (Term Frequency-Inverse Document Frequency)
    
    Definition: TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It considers both the frequency of the word in the document and its rarity across all documents.
    
    Example: Words that appear frequently in a document but rarely in the corpus will have a high TF-IDF value.
    
    Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining
    
    Deep Learning
    
    Definition: Deep learning is a subset of machine learning that uses artificial neural networks to solve complex problems. It focuses on learning high-level representations from data.
    
    Example: Deep learning models are applied in tasks like image recognition, speech recognition, and natural language processing.
    
    Related Keywords: Artificial Neural Networks, Machine Learning, Data Analysis
    
    Schema
    
    Definition: A schema defines the structure of a database or file, providing a blueprint for how data is stored and organized.
    
    Example: A relational database table schema specifies column names, data types, and key constraints.
    
    Related Keywords: Database, Data Modeling, Data Management
    
    DataFrame
    
    Definition: A DataFrame is a tabular data structure composed of rows and columns, commonly used for data analysis and processing.
    
    Example: In the pandas library, a DataFrame can contain columns of various data types, making data manipulation and analysis straightforward.
    
    Related Keywords: Data Analysis, pandas, Data Processing
    
    Attention Mechanism
    
    Definition: The Attention mechanism is a deep learning technique that allows models to focus on the most relevant parts of input data. It is mainly used in sequence data, such as text or time series.
    
    Example: In translation models, the Attention mechanism helps the model focus on the crucial parts of the input sentence to produce accurate translations.
    
    Related Keywords: Deep Learning, Natural Language Processing (NLP), Sequence Modeling
    
    Pandas
    
    Definition: Pandas is a Python library that provides tools for data analysis and manipulation. It enables efficient execution of data-related tasks.
    
    Example: Using pandas, you can read a CSV file, clean the data, and perform various analyses.
    
    Related Keywords: Data Analysis, Python, Data Processing
    
    GPT (Generative Pretrained Transformer)
    
    Definition: GPT is a generative language model pretrained on large datasets, designed for various text-based tasks. It generates natural language based on the provided input.
    
    Example: A chatbot that generates detailed answers to user queries can leverage the GPT model.
    
    Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning
    
    InstructGPT
    
    Definition: InstructGPT is a GPT model optimized to perform specific tasks based on user instructions. It is designed to generate more accurate and contextually relevant results.
    
    Example: If a user provides an instruction like "draft an email," InstructGPT generates an email based on the provided context.
    
    Related Keywords: Artificial Intelligence (AI), Natural Language Understanding (NLU), Instruction-Based Processing
    
    Keyword Search
    
    Definition: Keyword search is the process of finding information based on user-entered keywords. It is the fundamental search method used in most search engines and database systems.
    
    Example: When a user searches for "coffee shop Seoul," the search returns a list of relevant coffee shops in Seoul.
    
    Related Keywords: Search Engine, Data Search, Information Retrieval
    
    Page Rank
    
    Definition: Page Rank is an algorithm that evaluates the importance of web pages, primarily used to determine the ranking of search engine results. It analyzes the link structure between web pages to assess their importance.
    
    Example: Google’s search engine uses the Page Rank algorithm to determine the order of its search results.
    
    Related Keywords: Search Engine Optimization (SEO), Web Analytics, Link Analysis
    
    Data Mining
    
    Definition: Data mining is the process of extracting useful information from large datasets by utilizing techniques such as statistics, machine learning, and pattern recognition.
    
    Example: Analyzing customer purchase data to develop sales strategies in retail is a common example of data mining.
    
    Related Keywords: Big Data, Pattern Recognition, Predictive Analytics
    
    Multimodal
    
    Definition: Multimodal refers to technologies that combine and process multiple types of data modalities, such as text, images, and audio. It is used to extract or predict richer and more accurate information through the interaction of different data formats.
    
    Example: A system that analyzes both images and descriptive text to perform more accurate image 
    
    
    
    classification is an example of multimodal technology.
    
    Related Keywords: Data Fusion, Artificial Intelligence (AI), Deep Learning' metadata={'source': 'data/sample-word-document_eng.docx'}

PreviousExcel File Loading in LangChain NextMicrosoft PowerPoint

Last updated 1 month ago