MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval
This repository contains the MoLoRAG model, a logic-aware retrieval framework for multi-modal, multi-page document understanding, as presented in the paper MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval.
MoLoRAG introduces a novel approach to Document Question Answering (DocQA) by constructing a page graph to capture contextual and logical relationships between pages. A lightweight VLM performs graph traversal to retrieve relevant pages, combining both semantic and logical relevance for more accurate retrieval. The top-K retrieved pages are then fed into arbitrary Large Vision-Language Models (LVLMs) for question answering. The framework offers both a training-free solution for easy deployment and a fine-tuned version for enhanced logical relevance checking.
For more details, please refer to the official GitHub repository.
- Downloads last month
- 10