LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Document
  • Document Loader
  • load()
  • aload()
  • load_and_split()
  • lazy_load()
  • alazy_load()
  1. 06-DocumentLoader

Document & Document Loader

Previous06-DocumentLoaderNextPDF Loader

Last updated 28 days ago

  • Author:

  • Peer Review : ,

  • Proofread :

  • This is a part of

Overview

This tutorial covers the fundamental methods for loading Documents.

By completing this tutorial, you will learn how to load Documents and check their content and associated metadata.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "pypdf",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "01-DocumentLoader",
    }
)
Environment variables have been set successfully.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)
True

Document

Class for storing a piece of text and its associated metadata.

  • page_content (Required): Stores a piece of text as a string.

  • metadata (Optional): Stores metadata related to page_content as a dictionary.

from langchain_core.documents import Document

document = Document(page_content="Hello, welcome to LangChain Open Tutorial!")

# Check the attributes using __dict__
document.__dict__
{'id': None,
     'metadata': {},
     'page_content': 'Hello, welcome to LangChain Open Tutorial!',
     'type': 'Document'}

The metadata is empty. Let's add some values.

# Add metadata
document.metadata["source"] = "./example-file.pdf"
document.metadata["page"] = 0

# Check metadata
document.metadata
{'source': './example-file.pdf', 'page': 0}

Document Loader

Document Loader is a class that loads Documents from various sources.

Listed below are some examples of Document Loaders.

  • PyPDFLoader: Loads PDF files

  • CSVLoader: Loads CSV files

  • UnstructuredHTMLLoader: Loads HTML files

  • JSONLoader: Loads JSON files

  • TextLoader: Loads text files

  • DirectoryLoader: Loads documents from a directory

Now, let's learn how to load Documents .

# Example file path
FILE_PATH = "./data/01-document-loader-sample.pdf"
from langchain_community.document_loaders import PyPDFLoader

# Set up the loader
loader = PyPDFLoader(FILE_PATH)

load()

  • Loads Documents and returns them as a list[Document].

# Load Documents
docs = loader.load()
# Check the number of loaded Documents
len(docs)
48
# Check Documents
docs[0:10]
[Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content=' \n \n \nOctober  2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN  \nNational Science and Technology Council  \n \nNetworking and Information Technology \nResearch and Development Subcommittee  \n '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 1}, page_content=' ii  \n '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content=' \n  \n iii About the National Science and Technology Council  \nThe National Science and Technology Council (NSTC) is the principal means by which the Executive \nBranch coordinates science and technology policy across the diverse entities that make up the Federal \nresearch and development (R&D) enterprise . One of the NSTC’s primary objectives is establishing clear \nnational goal s for Federal science and technology investments . The NSTC prepares R&D packages aimed \nat accomplishing multiple national goals . The NSTC’s work is organized under five committees: \nEnvironment, Natural Resources, and Sustainability; Homeland and National S ecurity; Science, \nTechnology, Engineering, and Mathematics (STEM) Education; Science; and Technology . Each of these \ncommittees oversees subcommittees and working groups that are focused on different aspects of \nscience and technology . More information is available at  www.whitehouse.gov/ostp/nstc . \nAbout the Office of Science and Technology Policy  \nThe Office of Science and Technology Policy (OSTP) was established by the National Science and \nTechnology Policy, Organization, and Priorities Act of 1976 . The mission of OSTP is threefold; first, to \nprovide the President and his senior staff with accurate, relevant, and timely scientific and technical advice \non all matters of consequence; second, to ensure th at the policies of the Executive Branch are informed by \nsound science; and third, to ensure that the scientific and technical work of the Executive Branch is \nproperly coordinated so as to provide the greatest benefit to society.  The Director of OSTP also s erves as \nAssistant to the President for Science and Technology and manages the NSTC . More information is \navailable at www.whitehouse.gov/ostp . \nAbout the Subcommittee on Networking and Information Technology  Research and \nDevelopment  \nThe Subcommittee on Networking and Information Technology Research and Development (NITRD) is a \nbody under the Committee on Technology (CoT ) of the National Science and Technology Council (NSTC). \nThe NITRD Subcommittee coordinates multiagency research and development programs to help assure \ncontinued U.S. leadership in networking and information technology, satisfy the needs of the Federal \nGovernment for advanced networking and information technology, and accelerate development and \ndeployment of advanced networking and information technology. It also implements relevant provisions \nof the High -Performance Computing Act of 1991 (P.L. 102 -194), a s amended by the Next Generation \nInternet Research Act of 1998 (P. L. 105 -305), and the America Creating Opportunities to Meaningfully \nPromote Excellence in Technology, Education and Science (COMPETES) Act of 2007 (P.L. 110 -69). For \nmore information, see www.nitrd.gov . \nAcknowledgments  \nThis document was developed through the contributions of the members and staff of the NITRD Task \nForce on Artificial Intelligence. A special thanks and appreciation to additional contribut ors who helped \nwrite, edit, and review the document: Chaitan Baru (NSF), Eric Daimler (Presidential Innovation Fellow), \nRonald Ferguson (DoD), Nancy Forbes (NITRD), Eric Harder (DHS), Erin Kenneally (DHS), Dai Kim (DoD), \nTatiana Korelsky (NSF), David Kuehn  (DOT), Terence Langendoen (NSF), Peter Lyster (NITRD), KC Morris \n(NIST), Hector Munoz -Avila (NSF), Thomas Rindflesch (NIH), Craig Schlenoff (NIST), Donald Sofge (NRL) , \nand Sylvia Spengler (NSF).  \n  '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 3}, page_content=' \n  \n iv Copyright Information  \nThis is a work of the U.S. Government and is in the public domain. It may be freely distributed, copied, \nand translated; acknowledgment of publication by the Office of Science and Technology Policy  is \nappreciated. Any translation should include a disclaime r that the accuracy of the translation is the \nresponsibility of the translator and not OSTP . It is requested that a copy of any translation be sent to \nOSTP . This work is available for worldwide use and reuse and under the Creative Commons CC0 1.0 \nUniversal  license.  \n  '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 4}, page_content=''),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 5}, page_content=' \n  \n vi  \n  '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 6}, page_content=' \n  \n vii National Science and Technology Council  \nChair  \nJohn P. Holdren  \nAssistant to the President for Science and \nTechnology and Director, Office of Science and \nTechnology PolicyStaff  \nAfua Bruce  \nExecutive  Director  \nOffice of Science and Technology  Policy  \nSubcommittee on  \nMachine Learning and Artificial Intelligence  \nCo-Chair  \nEd Felten  \nDeputy U.S. Chief Technology Officer  \nOffice of Science and Technology Policy  \n Co-Chair  \nMichael Garris  \nSenior Scientist  \nNational Institute of Standards and Technology  \nU.S. Department of Commerce  \nSubcommittee on  \nNetworking and Information Technology Research and Development  \nCo-Chair  \nBryan Biegel  \nDirector, National C oordination Office for  \nNetworking and Information Technology  \nResearch and Development  Co-Chair  \nJames Kurose  \nAssistant Director, Computer and Information \nScience and Engineering  \nNational Science Foundation  \nNetworking and Information Technology Research and Development  \nTask Force on Artificial Intelligence  \n \nCo-Chair  \nLynne Parker  \nDivision Director  \nInformation and Intelligent System s \nNational Science Foundation  Co-Chair  \nJason Matheny  \nDirector  \nIntelligence Advanced Research Projects Activity   \n \nMembers   \nMilton Corn  \nNational Institutes of Health   \nNikunj Oza  \nNational Aeronautics and Space Administration  \nWilliam Ford  \nNational Institute of Justice  Robinson Pino  \nDepartment of Energy  \nMichael Garris  \nNational Institute of Standards  and Technology  Gregory Shannon  \nOffice of Science and Technology Policy  \nSteven Knox  \nNational Security Agency  Scott Tousley  \nDepartment of Homeland Security  '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 7}, page_content=' \nviii \n John Launchbury  \nDefense Advanced Research Projects Agency  Faisal D’Souza  \nTechnical Coordinator  \nNational Coordination Office for Networking and \nInformation Technology Research and Development  Richard Linderman  \nOffice of the Secretary of Defense  \n '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 8}, page_content='NATIONAL ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT STRATEGIC PLAN  \n \n 1 Contents  \nAbout the National Science and Technology Council  ................................ ................................ ..........................  iii \nAbout the Office of Science and Technology Policy  ................................ ................................ ............................  iii \nAbout the Subcommittee on Networking and Information Technology Research and Development  ................  iii \nAcknowledgments  ................................ ................................ ................................ ................................ ...............  iii \nCopyright Information  ................................ ................................ ................................ ................................ .......... iv \nNational Science and Technology Council  ................................ ................................ ................................ ...........  vii \nSubcommittee on Machine Learning and Artificial Intelligence  ................................ ................................ ..........  vii \nSubcommittee on Networking and Information Technology Research and Development  ................................ . vii \nTask Force on  Artificial Intelligence  ................................ ................................ ................................ .....................  vii \nExecutive Summary  ................................ ................................ ................................ ................................ ...................  3 \nIntroduction  ................................ ................................ ................................ ................................ ...............................  5 \nPurpose of the National AI R&D Strategic Plan  ................................ ................................ ................................  5 \nDesired Outcome  ................................ ................................ ................................ ................................ .............  7 \nA Vision for Advancing our National Priorities with AI  ................................ ................................ ....................  8 \nCurrent State of AI ................................ ................................ ................................ ................................ ..........  12 \nR&D Strategy  ................................ ................................ ................................ ................................ ...........................  15 \nStrategy 1: Make  Long-Term Investments in AI Research  ................................ ................................ .............  16 \nStrategy 2: Develop Effective Methods for Human -AI Collaboration   ................................ ...........................  22 \nStrategy 3 : Understand and Address the Ethical, Le gal, and Societal Implications of AI  ...............................  26 \nStrategy 4 : Ensure the Safety and Security of AI Systems  ................................ ................................ ..............  27 \nStrategy 5: Develop Shared Public Datasets and Environments for AI Training and Testing  .........................  30 \nStrategy 6: Measure and Evaluate AI Technologies through Standards and Benchmarks .............................  32 \nStrategy 7 : Better Understand the National AI R&D Workforce Needs  ................................ .........................  35 \nRecommendations  ................................ ................................ ................................ ................................ ...................  37 \nAcronyms  ................................ ................................ ................................ ................................ ................................ . 39 '),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 9}, page_content='NATIONAL ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT STRATEGIC PLAN  \n \n 2   ')]

aload()

  • Asynchronously loads Documents and returns them as a list[Document].

# Load Documents asynchronously
docs = await loader.aload()

load_and_split()

  • Loads Documents and automatically splits them into chunks using TextSplitter , and returns them as a list[Document].

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Set up the TextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)

# Split Documents into chunks
docs = loader.load_and_split(text_splitter=text_splitter)
# Check the number of loaded Documents
len(docs)
1441
# Check Documents
docs[0:10]
[Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content='October  2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content='National Science and Technology Council  \n \nNetworking and Information Technology \nResearch and Development Subcommittee'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 1}, page_content='ii'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='iii About the National Science and Technology Council'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='The National Science and Technology Council (NSTC) is the principal means by which the Executive'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='Branch coordinates science and technology policy across the diverse entities that make up the Federal'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='research and development (R&D) enterprise . One of the NSTC’s primary objectives is establishing clear'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='national goal s for Federal science and technology investments . The NSTC prepares R&D packages aimed'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='at accomplishing multiple national goals . The NSTC’s work is organized under five committees:'),
     Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='Environment, Natural Resources, and Sustainability; Homeland and National S ecurity; Science,')]

lazy_load()

  • Loads Documents sequentially and returns them as an Iterator[Document].

loader.lazy_load()

It can be observed that this method operates as a generator. This is a special type of iterator that produces values on-the-fly, without storing them all in memory at once.

# Load Documents sequentially
docs = loader.lazy_load()
for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length
{'source': './data/01-document-loader-sample.pdf', 'page': 0}

alazy_load()

  • Asynchronously loads Documents sequentially and returns them as an AsyncIterator[Document].

loader.alazy_load()

It can be observed that this method operates as an async_generator. This is a special type of asynchronous iterator that produces values on-the-fly, without storing them all in memory at once.

# Load Documents asynchronously and sequentially
docs = loader.alazy_load()
async for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length
{'source': './data/01-document-loader-sample.pdf', 'page': 0}

Set up the environment. You may refer to for more details.

You can check out the for more details.

Environment Setup
langchain-opentutorial
Jaemin Hong
Taylor(Jihyun Kim)
Yejin Park
Q0211
LangChain Open Tutorial
Document
Load Documents
Overview
Environment Setup
Document
Document Loader