Arxiv Loader

Author: Sunyoung Park (architectyou)
Peer Review : ppakyeah
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial

Overview

arXiv is an open access archive for 2 million scholarly articles in the fields of physics,

mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems

science, and economics.

API Documentation

To access the Arxiv document loader, you need to install arxiv, PyMuPDF and langchain-community integration packages.

PyMuPDF converts PDF files downloaded from arxiv.org into text format.

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain-community",
        "arxiv",
        "pymupdf",
    ],
    verbose=False,
    upgrade=False,
)

    [notice] A new release of pip is available: 23.3.2 -> 24.3.1
    [notice] To update, run: pip install --upgrade pip

Arxiv-Loader-Instantiate

You can make arxiv loader instance to load documents from arxiv.org.

Initialize with search query to find documents in the Arixiv.org. Supports all arguments of ArxivAPIWrapper .

from langchain_community.document_loaders import ArxivLoader

### Enter the research topic you want to search for in the Query parameter
loader = ArxivLoader(
    query="Chain of thought",
    load_max_docs=2,  # max number of documents
    load_all_available_meta=True,  # load all available metadata
)

Load

Use Load method to load documents from arxiv.org with ArxivLoader instance.

# Print the first document's content and metadata
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Contrastive Chain-of-Thought Prompting
    Yew Ken Chia∗1,
    Guizhen Chen∗1, 2
    Luu Anh Tuan2
    Soujanya Pori
    {'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning demonstrations, to guide the model to\nreason step-by-step while reducing reasoning mistakes. To improve\ngeneralization, we introduce an automatic method to construct contrastive\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\ncontrastive chain of thought can serve as a general enhancement of\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}

If load_all_available_meta is False, only partial metadata is displayed, not the complete metadata.

Lazy Load

When loading large amounts of documents, If you can perform downstream tasks on a subset of all loaded documents, you can lazy_load documents one at a time to minimize memory usage.

docs = []
docs_lazy = loader.lazy_load()

# append docs to docs list
# async variant : docs_lazy = await loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)

print(docs[0].page_content[:100])
print(docs[0].metadata)

Contrastive Chain-of-Thought Prompting
    Yew Ken Chia∗1,
    Guizhen Chen∗1, 2
    Luu Anh Tuan2
    Soujanya Pori
    {'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning demonstrations, to guide the model to\nreason step-by-step while reducing reasoning mistakes. To improve\ngeneralization, we introduce an automatic method to construct contrastive\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\ncontrastive chain of thought can serve as a general enhancement of\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}

len(docs)

Asynchronous Load

Use aload method to load documents from arxiv.org asynchronously.

docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Contrastive Chain-of-Thought Prompting
    Yew Ken Chia∗1,
    Guizhen Chen∗1, 2
    Luu Anh Tuan2
    Soujanya Pori
    {'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning demonstrations, to guide the model to\nreason step-by-step while reducing reasoning mistakes. To improve\ngeneralization, we introduce an automatic method to construct contrastive\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\ncontrastive chain of thought can serve as a general enhancement of\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}

Use Summaries of Articles as Docs

Use get_summaries_as_docs method to get summaries of articles as docs.

from langchain_community.document_loaders import ArxivLoader

loader = ArxivLoader(
    query="reasoning"
)

docs = loader.get_summaries_as_docs()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Large language models (LLMs) have demonstrated impressive reasoning
    abilities, but they still strugg
    {'Entry ID': 'http://arxiv.org/abs/2410.13080v1', 'Published': datetime.date(2024, 10, 16), 'Title': 'Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models', 'Authors': 'Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, Shirui Pan'}

PreviousJSON NextUpstageDocumentParseLoader

Last updated 3 months ago