MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

This repository contains the MoLoRAG model, a logic-aware retrieval framework for multi-modal, multi-page document understanding, as presented in the paper MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval.

MoLoRAG introduces a novel approach to Document Question Answering (DocQA) by constructing a page graph to capture contextual and logical relationships between pages. A lightweight VLM performs graph traversal to retrieve relevant pages, combining both semantic and logical relevance for more accurate retrieval. The top-K retrieved pages are then fed into arbitrary Large Vision-Language Models (LVLMs) for question answering. The framework offers both a training-free solution for easy deployment and a fine-tuned version for enhanced logical relevance checking.

For more details, please refer to the official GitHub repository.

Downloads last month: 10

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

Visual Document Retrieval

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support