Document & Document Loader

Open in ColabOpen in GitHub

Overview

This tutorial covers the fundamental methods for loading Documents.

By completing this tutorial, you will learn how to load Documents and check their content and associated metadata.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can check out the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Document

Class for storing a piece of text and its associated metadata.

  • page_content (Required): Stores a piece of text as a string.

  • metadata (Optional): Stores metadata related to page_content as a dictionary.

The metadata is empty. Let's add some values.

Document Loader

Document Loader is a class that loads Documents from various sources.

Listed below are some examples of Document Loaders.

  • PyPDFLoader: Loads PDF files

  • CSVLoader: Loads CSV files

  • UnstructuredHTMLLoader: Loads HTML files

  • JSONLoader: Loads JSON files

  • TextLoader: Loads text files

  • DirectoryLoader: Loads documents from a directory

Now, let's learn how to load Documents .

load()

  • Loads Documents and returns them as a list[Document].

aload()

  • Asynchronously loads Documents and returns them as a list[Document].

load_and_split()

  • Loads Documents and automatically splits them into chunks using TextSplitter , and returns them as a list[Document].

lazy_load()

  • Loads Documents sequentially and returns them as an Iterator[Document].

It can be observed that this method operates as a generator. This is a special type of iterator that produces values on-the-fly, without storing them all in memory at once.

alazy_load()

  • Asynchronously loads Documents sequentially and returns them as an AsyncIterator[Document].

It can be observed that this method operates as an async_generator. This is a special type of asynchronous iterator that produces values on-the-fly, without storing them all in memory at once.

Last updated