---
tags:
- multimodal
- multilingual
- pdf
- embeddings
- rag
- google-cloud
- vertex-ai
- gemma
- python
datasets:
  - no_dataset
license: mit
---

# Multimodal & Multilingual PDF Embedding Pipeline with Gemma and Vertex AI

This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images) using **Google's Gemma model (running locally)**, and then creates multilingual text embeddings for all extracted information using **Google Cloud Vertex AI's `text-multilingual-embedding-002` model**. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.

**Key Features:**
- **Multimodal Descriptions (via Gemma):** Processes tables and images from PDFs, generating rich descriptive text in French using the open-source Gemma 3.4B-IT model, which runs locally on your machine/Colab GPU.
- **Multilingual Text Embeddings (via Vertex AI):** Leverages Google Cloud's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages.
- **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.

## How it Works

1.  **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data.
2.  **Content Separation:** Distinguishes between plain text, tables, and non-table images.
3.  **Multimodal Description (for Tables & Images using Gemma):**
    - For tables, the pipeline captures an image of the table and also uses its text representation.
    - For standalone images (e.g., graphs, charts), it captures the image.
    - These images (and optionally table text) are then passed to the **Gemma 3.4B-IT model** (via the `gemma` Python library) with specific prompts to generate rich, descriptive text in French. **This step runs locally and does not incur direct API costs.**
4.  **Multilingual Text Embedding (via Vertex AI):**
    - The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI).
    - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content. **This step connects to Google Cloud Vertex AI and will incur costs.**
5.  **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.

## Requirements & Setup

This pipeline uses a combination of local models (Gemma) and **Google Cloud Platform** services (Vertex AI).

1.  **Google Cloud Project with Billing Enabled (for Text Embeddings):**
    -   **CRITICAL:** The text embedding generation step uses Google Cloud Vertex AI. This **will incur costs** on your Google Cloud Platform account. Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project.
    -   Enable the **Vertex AI API**.
2.  **Authentication for Google Cloud (for Text Embeddings):**
    -   The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`.
    -   For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`).
3.  **Hardware Requirements (for Gemma):**
    -   Running the Gemma 3.4B-IT model requires a **GPU with sufficient VRAM** (e.g., a Colab T4 or V100 GPU, or a local GPU with at least ~8-10GB VRAM is recommended). If a GPU is not available, Gemma will likely run on CPU but will be significantly slower.

### Local Setup

1.  **Clone the repository:**
    ```bash
    git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline
    cd pdf-multimodal-multilingual-embedding-pipeline
    ```
2.  **Install Python dependencies:**
    ```bash
    pip install -r requirements.txt
    ```
    **System-level dependencies for Camelot/PyMuPDF (Linux/Colab):**
    You might need to install these system packages for `PyMuPDF` and `Camelot` to function correctly.
    ```bash
    # Update package list
    sudo apt-get update
    # Install Ghostscript (required by Camelot)
    sudo apt-get install -y ghostscript
    # Install python3-tk (required by some PyMuPDF functionalities)
    sudo apt-get install -y python3-tk
    # Install OpenCV (via apt, for camelot-py[cv])
    sudo apt-get install -y libopencv-dev python3-opencv
    ```
    *Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.*

3.  **Set up Environment Variables (for Vertex AI Text Embeddings):**
    ```bash
    export GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
    export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4)
    ```
    Replace `your-gcp-project-id` and `us-central1` with your actual Google Cloud Project ID and Vertex AI region.

4.  **Place your PDF files:**
    Create a `docs` directory in the root of the repository and place your PDF documents inside it.
    ```
    pdf-multimodal-multilingual-embedding-pipeline/
    ├── docs/
    │   └── your_document.pdf
    └── another_document.pdf
    ```

5.  **Run the pipeline:**
    ```bash
    python run_pipeline.py
    ```
    The generated embedding file (`embeddings_statistiques_multimodal.json`) and extracted assets will be saved in the `output/` directory.

### Google Colab Usage

A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments and GPU access.

1.  **Open a new Google Colab notebook.**
2.  **Change runtime to GPU:** Go to `Runtime > Change runtime type` and select `T4 GPU` or `V100 GPU`.
3.  **Install system and Python dependencies:**
    ```python
    !pip uninstall -y camelot camelot-py # Ensure clean install
    !pip install PyMuPDF
    !apt-get update
    !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv
    !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow gemma jax jaxlib numpy
    ```
4.  **Authenticate to Google Cloud (for Vertex AI):**
    ```python
    from google.colab import auth
    auth.authenticate_user()
    ```
5.  **Set your Google Cloud Project ID and Location:**
    ```python
    import os
    # Replace with your actual Google Cloud Project ID
    os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE"
    # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4")
    os.environ["VERTEX_AI_LOCATION"] = "us-central1"
    
    # Critical: Adjust JAX memory allocation for Gemma
    os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
    ```
6.  **Upload your PDF files:**
    You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`.
    ```python
    # Example for uploading
    from google.colab import files
    import os
    from pathlib import Path

    PDF_DIRECTORY = Path("/content/docs")
    PDF_DIRECTORY.mkdir(parents=True, exist_ok=True)
    uploaded = files.upload()
    for filename in uploaded.keys():
        os.rename(filename, PDF_DIRECTORY / filename)
    ```
7.  **Copy and paste the code from `src/pdf_processor.py`, `src/embedding_utils.py` and `run_pipeline.py` into Colab cells and execute.** Make sure to execute `embedding_utils.py` content first, then `pdf_processor.py` content, then `run_pipeline.py` content, or combine them logically into your notebook.

## Output

The pipeline will generate:
- `embeddings_statistiques_multimodal.json`: A JSON file containing all generated embeddings and their metadata.
- `output/extracted_graphs/`: Directory containing extracted images (PNG format).
- `output/extracted_tables/`: Directory containing HTML representations of extracted tables.

## Example `embeddings_statistiques_multimodal.json` Entry

```json
[
  {
    "pdf_file": "sample.pdf",
    "page_number": 1,
    "chunk_id": "text_0",
    "content_type": "text",
    "text_content": "This is a chunk of text extracted from the first page of the document...",
    "embedding": [0.123, -0.456, ..., 0.789],
    "pdf_title": "Sample Document",
    "pdf_subject": "Data Analysis",
    "pdf_keywords": "statistics, report"
  },
  {
    "pdf_file": "sample.pdf",
    "page_number": 2,
    "chunk_id": "table_0",
    "content_type": "table",
    "text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...",
    "embedding": [-0.987, 0.654, ..., 0.321],
    "table_html_url": "/static/extracted_tables/sample_p2_table0.html",
    "image_url": "/static/extracted_graphs/sample_p2_table0.png",
    "pdf_title": "Sample Document",
    "pdf_subject": "Data Analysis",
    "pdf_keywords": "statistics, report"
  },
  {
    "pdf_file": "sample.pdf",
    "page_number": 3,
    "chunk_id": "image_0",
    "content_type": "image",
    "text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.",
    "embedding": [0.456, -0.789, ..., 0.123],
    "image_url": "/static/extracted_graphs/sample_p3_img0.png",
    "pdf_title": "Sample Document",
    "pdf_subject": "Data Analysis",
    "pdf_keywords": "statistics, report"
  }
]
```

# Acknowledgments
This pipeline leverages the power of:
- Gemma AI
- Google AI Gemini Models
- PyMuPDF
- Camelot
- Tiktoken
- Pandas
- BeautifulSoup