Excel File Loading in LangChain

Author: Hwayoung Cha
Peer Review :
Proofread : Youngjun cho
This is a part of LangChain Open Tutorial

Overview

This tutorial covers the process of loading and handling Microsoft Excel files in LangChain .

It focuses on two primary methods: UnstructuredExcelLoader for raw text extraction and DataFrameLoader for structured data processing.

The guide aims to help developers effectively integrate Excel data into their LangChain projects, covering both basic and advanced usage scenarios.

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "unstructured",
        "openpyxl"
    ],
    verbose=False,
    upgrade=False,
)

`UnstructuredExcelLoader`

UnstructuredExcelLoader is used to load Microsoft Excel files.

This loader works with both .xlsx and .xls files.

When the loader is used in mode="elements" , an HTML representation of the Excel file is provided under the text_as_html key in the document metadata.

# install
# %pip install -qU langchain-community unstructured openpyxl

import sys
from langchain_community.document_loaders import UnstructuredExcelLoader

# Set recursion limit
sys.setrecursionlimit(10**6)    

# Create UnstructuredExcelLoader 
loader = UnstructuredExcelLoader("./data/titanic.xlsx", mode="elements")

# Load a document
docs = loader.load()

# Print the number of documents
print(len(docs))

This confirms that one document has been loaded.

The page_content contains the data from each row, while the text_as_html in the metadata stores the data in HTML format.

# Print the document
print(docs[0].page_content[:200])

# Print the text_as_html of metadata
print(docs[0].metadata["text_as_html"][:1000])

`DataFrameLoader`

Similar to CSV files, we can load Excel files by using the read_excel() function to create a pandas.DataFrame, and then load it.

import pandas as pd

# read the Excel file
df = pd.read_excel("./data/titanic.xlsx")

from langchain_community.document_loaders import DataFrameLoader

# Set up DataFrame loader, specifying the page content column
loader = DataFrameLoader(df, page_content_column="Name")

# Load the document
docs = loader.load()

# Print the data
print(docs[0].page_content)

# Print the metadata
print(docs[0].metadata)

PreviousCSV Loader NextMicrosoft Word(doc, docx) With Langchain

Last updated 3 months ago

Overview

Table of Contents

Environment Setup

UnstructuredExcelLoader

DataFrameLoader

`UnstructuredExcelLoader`

`DataFrameLoader`