Arxiv Loader
Author: Sunyoung Park (architectyou)
Peer Review : ppakyeah
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial
Overview
arXiv
is an open access archive for 2 million scholarly articles in the fields of physics,
mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems
science, and economics.
To access the Arxiv document loader, you need to install arxiv
, PyMuPDF
and langchain-community
integration packages.
PyMuPDF
converts PDF files downloaded from arxiv.org into text format.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langchain-community",
"arxiv",
"pymupdf",
],
verbose=False,
upgrade=False,
)
[notice] A new release of pip is available: 23.3.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Arxiv-Loader-Instantiate
You can make arxiv loader instance to load documents from arxiv.org.
Initialize with search query to find documents in the Arixiv.org.
Supports all arguments of ArxivAPIWrapper
.
from langchain_community.document_loaders import ArxivLoader
### Enter the research topic you want to search for in the Query parameter
loader = ArxivLoader(
query="Chain of thought",
load_max_docs=2, # max number of documents
load_all_available_meta=True, # load all available metadata
)
Load
Use Load
method to load documents from arxiv.org with ArxivLoader
instance.
# Print the first document's content and metadata
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Contrastive Chain-of-Thought Prompting
Yew Ken Chia∗1,
Guizhen Chen∗1, 2
Luu Anh Tuan2
Soujanya Pori
{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning demonstrations, to guide the model to\nreason step-by-step while reducing reasoning mistakes. To improve\ngeneralization, we introduce an automatic method to construct contrastive\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\ncontrastive chain of thought can serve as a general enhancement of\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}
If
load_all_available_meta
is False, only partial metadata is displayed, not the complete metadata.
Lazy Load
When loading large amounts of documents, If you can perform downstream tasks on a subset of all loaded documents, you can lazy_load
documents one at a time to minimize memory usage.
docs = []
docs_lazy = loader.lazy_load()
# append docs to docs list
# async variant : docs_lazy = await loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Contrastive Chain-of-Thought Prompting
Yew Ken Chia∗1,
Guizhen Chen∗1, 2
Luu Anh Tuan2
Soujanya Pori
{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning demonstrations, to guide the model to\nreason step-by-step while reducing reasoning mistakes. To improve\ngeneralization, we introduce an automatic method to construct contrastive\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\ncontrastive chain of thought can serve as a general enhancement of\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}
len(docs)
3
Asynchronous Load
Use aload
method to load documents from arxiv.org asynchronously.
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Contrastive Chain-of-Thought Prompting
Yew Ken Chia∗1,
Guizhen Chen∗1, 2
Luu Anh Tuan2
Soujanya Pori
{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning demonstrations, to guide the model to\nreason step-by-step while reducing reasoning mistakes. To improve\ngeneralization, we introduce an automatic method to construct contrastive\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\ncontrastive chain of thought can serve as a general enhancement of\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}
Use Summaries of Articles as Docs
Use get_summaries_as_docs
method to get summaries of articles as docs.
from langchain_community.document_loaders import ArxivLoader
loader = ArxivLoader(
query="reasoning"
)
docs = loader.get_summaries_as_docs()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Large language models (LLMs) have demonstrated impressive reasoning
abilities, but they still strugg
{'Entry ID': 'http://arxiv.org/abs/2410.13080v1', 'Published': datetime.date(2024, 10, 16), 'Title': 'Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models', 'Authors': 'Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, Shirui Pan'}
Last updated