MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

This repository contains the MoLoRAG model, a logic-aware retrieval framework for multi-modal, multi-page document understanding, as presented in the paper MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval.

MoLoRAG introduces a novel approach to Document Question Answering (DocQA) by constructing a page graph to capture contextual and logical relationships between pages. A lightweight VLM performs graph traversal to retrieve relevant pages, combining both semantic and logical relevance for more accurate retrieval. The top-K retrieved pages are then fed into arbitrary Large Vision-Language Models (LVLMs) for question answering. The framework offers both a training-free solution for easy deployment and a fine-tuned version for enhanced logical relevance checking.

For more details, please refer to the official GitHub repository.

Downloads last month
10
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support