The UpstageDocumentParseLoader is a robust document analysis tool designed by Upstage that seamlessly integrates with the LangChain framework as a document loader. It specializes in transforming documents into structured HTML by analyzing their layout and content.
Key Features :
Comprehensive Layout Analysis : Analyzes and identifies structural elements like headings, paragraphs, tables, and images across various document formats (e.g., PDFs, images).
Automated Structural Recognition : Automatically detects and serializes document elements based on reading order for accurate conversion to HTML.
Optional OCR Support : Includes optical character recognition for handling scanned or image-based documents. The OCR mode supports:
force : Extracts text from images using OCR.
auto : Extracts text from PDFs (throws an error if the input is not in PDF format).
By recognizing and preserving the relationships between document elements, the UpstageDocumentParseLoader enables precise and context-aware document analysis.
Migration from Layout Analysis : Upstage has launched Document Parse to replace Layout Analysis! Document Parse now supports a wider range of document types, markdown output, chart detection, equation recognition, and additional features planned for upcoming releases. The last version of Layout Analysis, layout-analysis-0.4.0, will be officially discontinued by November 10, 2024.
use_ocr option has been replaced with ocr . Instead of True/False , it now accepts force or auto for more precise control.
output_type → output_format
output_type option has been renamed to output_format for specifying the format of the output.
exclude → base64_encoding
The exclude option has been replaced with base64_encoding . While exclude was used to exclude specific elements from the output, base64_encoding specifies whether to encode elements of certain categories in Base64.
After downloading the PDF file from the provided link, create a data folder in the current directory and save the PDF file into that folder.
# Download and save sample PDF file to ./data directoryimport requestsdefdownload_pdf(url,save_path):""" Downloads a PDF file from the given URL and saves it to the specified path. Args: url (str): The URL of the PDF file to download. save_path (str): The full path (including file name) where the file will be saved. """try:# Ensure the directory exists os.makedirs(os.path.dirname(save_path), exist_ok=True)# Download the file response = requests.get(url, stream=True) response.raise_for_status()# Raise an error for bad status codes# Save the file to the specified pathwithopen(save_path, "wb")as file:for chunk in response.iter_content(chunk_size=8192): file.write(chunk)print(f"PDF downloaded and saved to: {save_path}")exceptExceptionas e:print(f"An error occurred while downloading the file: {e}")# Configuration for the PDF filepdf_url ="https://arxiv.org/pdf/2407.21059"file_path ="./data/2407.21059.pdf"# Download the PDFdownload_pdf(pdf_url, file_path)
PDF downloaded and saved to: ./data/2407.21059.pdf
# Set file pathFILE_PATH ="data/2407.21059.pdf"# modify to your file path
from langchain_upstage import UpstageDocumentParseLoader# Configure the document loaderloader =UpstageDocumentParseLoader( FILE_PATH, output_format="html", split="page", ocr="auto", coordinates=True, base64_encoding=["chart"],)# Load the documentdocs = loader.load()# Print the resultsfor doc in docs[:2]:print(doc)
page_content='1 Modular RAG: Transforming RAG Systems into
LEGO-like Reconfigurable Frameworks
Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang Abstract—Retrieval-augmented Generation (RAG) has
markedly enhanced the capabilities of Large Language Models
(LLMs) in tackling knowledge-intensive tasks. The increasing
demands of application scenarios have driven the evolution
of RAG, leading to the integration of advanced retrievers,
LLMs and other complementary technologies, which in turn
has amplified the intricacy of RAG systems. However, the rapid
advancements are outpacing the foundational RAG paradigm,
with many methods struggling to be unified under the process
of “retrieve-then-generate”. In this context, this paper examines
the limitations of the existing RAG paradigm and introduces
the modular RAG framework. By decomposing complex RAG
systems into independent modules and specialized operators, it
facilitates a highly reconfigurable framework. Modular RAG
transcends the traditional linear architecture, embracing a
more advanced design that integrates routing, scheduling, and
fusion mechanisms. Drawing on extensive research, this paper
further identifies prevalent RAG patterns—linear, conditional,
branching, and looping—and offers a comprehensive analysis
of their respective implementation nuances. Modular RAG
presents innovative opportunities for the conceptualization
and deployment of RAG systems. Finally, the paper explores
the potential emergence of new operators and paradigms,
establishing a solid theoretical foundation and a practical
roadmap for the continued evolution and practical deployment
of RAG technologies.
Index Terms—Retrieval-augmented generation, large language
model, modular system, information retrieval I. INTRODUCTION
2024
Jul
26
[cs.CL]
arXiv:2407.21059v1
L remarkable capabilities, yet they still face numerous
ARGE Language Models (LLMs) have demonstrated
challenges, such as hallucination and the lag in information up-
dates [1]. Retrieval-augmented Generation (RAG), by access-
ing external knowledge bases, provides LLMs with important
contextual information, significantly enhancing their perfor-
mance on knowledge-intensive tasks [2]. Currently, RAG, as
an enhancement method, has been widely applied in various
practical application scenarios, including knowledge question
answering, recommendation systems, customer service, and
personal assistants. [3]–[6]
During the nascent stages of RAG , its core framework is
constituted by indexing, retrieval, and generation, a paradigm
referred to as Naive RAG [7]. However, as the complexity
of tasks and the demands of applications have escalated, the Yunfan Gao is with Shanghai Research Institute for Intelligent Autonomous
Systems, Tongji University, Shanghai, 201210, China.
Yun Xiong is with Shanghai Key Laboratory of Data Science, School of
Computer Science, Fudan University, Shanghai, 200438, China.
Meng Wang and Haofen Wang are with College of Design and Innovation,
Tongji University, Shanghai, 20092, China. (Corresponding author: Haofen
Wang. E-mail: carter.whfcarter@gmail.com)
limitations of Naive RAG have become increasingly apparent.
As depicted in Figure 1, it predominantly hinges on the
straightforward similarity of chunks, result in poor perfor-
mance when confronted with complex queries and chunks with
substantial variability. The primary challenges of Naive RAG
include: 1) Shallow Understanding of Queries. The semantic
similarity between a query and document chunk is not always
highly consistent. Relying solely on similarity calculations
for retrieval lacks an in-depth exploration of the relationship
between the query and the document [8]. 2) Retrieval Re-
dundancy and Noise. Feeding all retrieved chunks directly
into LLMs is not always beneficial. Research indicates that
an excess of redundant and noisy information may interfere
with the LLM’s identification of key information, thereby
increasing the risk of generating erroneous and hallucinated
responses. [9]
To overcome the aforementioned limitations, Advanced
RAG paradigm focuses on optimizing the retrieval phase,
aiming to enhance retrieval efficiency and strengthen the
utilization of retrieved chunks. As shown in Figure 1 ,typical
strategies involve pre-retrieval processing and post-retrieval
processing. For instance, query rewriting is used to make
the queries more clear and specific, thereby increasing the
accuracy of retrieval [10], and the reranking of retrieval results
is employed to enhance the LLM’s ability to identify and
utilize key information [11].
Despite the improvements in the practicality of Advanced
RAG, there remains a gap between its capabilities and real-
world application requirements. On one hand, as RAG tech-
nology advances, user expectations rise, demands continue to
evolve, and application settings become more complex. For
instance, the integration of heterogeneous data and the new
demands for system transparency, control, and maintainability.
On the other hand, the growth in application demands has
further propelled the evolution of RAG technology.
As shown in Figure 2, to achieve more accurate and efficient
task execution, modern RAG systems are progressively inte-
grating more sophisticated function, such as organizing more
refined index base in the form of knowledge graphs, integrat-
ing structured data through query construction methods, and
employing fine-tuning techniques to enable encoders to better
adapt to domain-specific documents.
In terms of process design, the current RAG system has
surpassed the traditional linear retrieval-generation paradigm.
Researchers use iterative retrieval [12] to obtain richer con-
text, recursive retrieval [13] to handle complex queries, and
adaptive retrieval [14] to provide overall autonomy and flex-
ibility. This flexibility in the process significantly enhances' metadata={'page': 1, 'base64_encodings': [], 'coordinates': [[{'x': 0.9137, 'y': 0.0321}, {'x': 0.9206, 'y': 0.0321}, {'x': 0.9206, 'y': 0.0418}, {'x': 0.9137, 'y': 0.0418}], [{'x': 0.1037, 'y': 0.0715}, {'x': 0.8961, 'y': 0.0715}, {'x': 0.8961, 'y': 0.1385}, {'x': 0.1037, 'y': 0.1385}], [{'x': 0.301, 'y': 0.149}, {'x': 0.6988, 'y': 0.149}, {'x': 0.6988, 'y': 0.1673}, {'x': 0.301, 'y': 0.1673}], [{'x': 0.0785, 'y': 0.2203}, {'x': 0.4943, 'y': 0.2203}, {'x': 0.4943, 'y': 0.5498}, {'x': 0.0785, 'y': 0.5498}], [{'x': 0.0785, 'y': 0.5566}, {'x': 0.4926, 'y': 0.5566}, {'x': 0.4926, 'y': 0.5837}, {'x': 0.0785, 'y': 0.5837}], [{'x': 0.2176, 'y': 0.6044}, {'x': 0.3518, 'y': 0.6044}, {'x': 0.3518, 'y': 0.6205}, {'x': 0.2176, 'y': 0.6205}], [{'x': 0.0254, 'y': 0.2747}, {'x': 0.0612, 'y': 0.2747}, {'x': 0.0612, 'y': 0.7086}, {'x': 0.0254, 'y': 0.7086}], [{'x': 0.0764, 'y': 0.625}, {'x': 0.4947, 'y': 0.625}, {'x': 0.4947, 'y': 0.7904}, {'x': 0.0764, 'y': 0.7904}], [{'x': 0.0774, 'y': 0.7923}, {'x': 0.4942, 'y': 0.7923}, {'x': 0.4942, 'y': 0.8539}, {'x': 0.0774, 'y': 0.8539}], [{'x': 0.0773, 'y': 0.8701}, {'x': 0.4946, 'y': 0.8701}, {'x': 0.4946, 'y': 0.9447}, {'x': 0.0773, 'y': 0.9447}], [{'x': 0.5068, 'y': 0.221}, {'x': 0.9234, 'y': 0.221}, {'x': 0.9234, 'y': 0.4605}, {'x': 0.5068, 'y': 0.4605}], [{'x': 0.5074, 'y': 0.4636}, {'x': 0.9243, 'y': 0.4636}, {'x': 0.9243, 'y': 0.6131}, {'x': 0.5074, 'y': 0.6131}], [{'x': 0.5067, 'y': 0.6145}, {'x': 0.9234, 'y': 0.6145}, {'x': 0.9234, 'y': 0.7483}, {'x': 0.5067, 'y': 0.7483}], [{'x': 0.5071, 'y': 0.7504}, {'x': 0.9236, 'y': 0.7504}, {'x': 0.9236, 'y': 0.8538}, {'x': 0.5071, 'y': 0.8538}], [{'x': 0.5073, 'y': 0.8553}, {'x': 0.9247, 'y': 0.8553}, {'x': 0.9247, 'y': 0.9466}, {'x': 0.5073, 'y': 0.9466}]]}
page_content='2
the expressive power and adaptability of RAG systems, en-
abling them to better adapt to various application scenarios.
However, this also makes the orchestration and scheduling of
workflows more complex, posing greater challenges to system
design. Specifically, RAG currently faces the following new
challenges:
Complex data sources integration. RAG are no longer
confined to a single type of unstructured text data source but
have expanded to include various data types, such as semi-
structured data like tables and structured data like knowledge
graphs [15]. Access to heterogeneous data from multiple
sources can provide the system with a richer knowledge
background, and more reliable knowledge verification capa-
bilities [16].
Fig. 2. Case of current Modular RAG.The system integrates diverse data
and more functional components. The process is no longer confined to linear
but is controlled by multiple control components for retrieval and generation,
making the entire system more flexible and complex.
New demands for system interpretability, controllability,' metadata={'page': 2, 'base64_encodings': [], 'coordinates': [[{'x': 0.9113, 'y': 0.0313}, {'x': 0.9227, 'y': 0.0313}, {'x': 0.9227, 'y': 0.0417}, {'x': 0.9113, 'y': 0.0417}], [{'x': 0.0777, 'y': 0.0649}, {'x': 0.9163, 'y': 0.0649}, {'x': 0.9163, 'y': 0.7149}, {'x': 0.0777, 'y': 0.7149}], [{'x': 0.0775, 'y': 0.7174}, {'x': 0.4936, 'y': 0.7174}, {'x': 0.4936, 'y': 0.8062}, {'x': 0.0775, 'y': 0.8062}], [{'x': 0.0762, 'y': 0.8089}, {'x': 0.4923, 'y': 0.8089}, {'x': 0.4923, 'y': 0.9293}, {'x': 0.0762, 'y': 0.9293}], [{'x': 0.5077, 'y': 0.8791}, {'x': 0.9248, 'y': 0.8791}, {'x': 0.9248, 'y': 0.926}, {'x': 0.5077, 'y': 0.926}], [{'x': 0.0911, 'y': 0.9311}, {'x': 0.4948, 'y': 0.9311}, {'x': 0.4948, 'y': 0.9462}, {'x': 0.0911, 'y': 0.9462}]]}