--- tags: - multimodal - multilingual - pdf - embeddings - rag - google-cloud - vertex-ai - gemma - python datasets: - no_dataset license: mit --- # Multimodal & Multilingual PDF Embedding Pipeline with Gemma and Vertex AI This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images) using **Google's Gemma model (running locally)**, and then creates multilingual text embeddings for all extracted information using **Google Cloud Vertex AI's `text-multilingual-embedding-002` model**. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications. **Key Features:** - **Multimodal Descriptions (via Gemma):** Processes tables and images from PDFs, generating rich descriptive text in French using the open-source Gemma 3.4B-IT model, which runs locally on your machine/Colab GPU. - **Multilingual Text Embeddings (via Vertex AI):** Leverages Google Cloud's `text-multilingual-embedding-002` model for embeddings, supporting a wide range of languages. - **Structured Output:** Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format. ## How it Works 1. **PDF Parsing:** Utilizes `PyMuPDF` to extract text blocks and images, and `Camelot` to accurately extract tabular data. 2. **Content Separation:** Distinguishes between plain text, tables, and non-table images. 3. **Multimodal Description (for Tables & Images using Gemma):** - For tables, the pipeline captures an image of the table and also uses its text representation. - For standalone images (e.g., graphs, charts), it captures the image. - These images (and optionally table text) are then passed to the **Gemma 3.4B-IT model** (via the `gemma` Python library) with specific prompts to generate rich, descriptive text in French. **This step runs locally and does not incur direct API costs.** 4. **Multilingual Text Embedding (via Vertex AI):** - The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the `text-multilingual-embedding-002` model (via Vertex AI). - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content. **This step connects to Google Cloud Vertex AI and will incur costs.** 5. **JSON Output:** All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file. ## Requirements & Setup This pipeline uses a combination of local models (Gemma) and **Google Cloud Platform** services (Vertex AI). 1. **Google Cloud Project with Billing Enabled (for Text Embeddings):** - **CRITICAL:** The text embedding generation step uses Google Cloud Vertex AI. This **will incur costs** on your Google Cloud Platform account. Ensure you have an [active billing account](https://cloud.google.com/billing/docs/how-to/create-billing-account) linked to your project. - Enable the **Vertex AI API**. 2. **Authentication for Google Cloud (for Text Embeddings):** - The easiest way to run this in a Colab environment is using `google.colab.auth.authenticate_user()`. - For local execution, ensure your Google Cloud SDK is configured and authenticated (`gcloud auth application-default login`). 3. **Hardware Requirements (for Gemma):** - Running the Gemma 3.4B-IT model requires a **GPU with sufficient VRAM** (e.g., a Colab T4 or V100 GPU, or a local GPU with at least ~8-10GB VRAM is recommended). If a GPU is not available, Gemma will likely run on CPU but will be significantly slower. ### Local Setup 1. **Clone the repository:** ```bash git clone https://huggingface.co/Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline cd pdf-multimodal-multilingual-embedding-pipeline ``` 2. **Install Python dependencies:** ```bash pip install -r requirements.txt ``` **System-level dependencies for Camelot/PyMuPDF (Linux/Colab):** You might need to install these system packages for `PyMuPDF` and `Camelot` to function correctly. ```bash # Update package list sudo apt-get update # Install Ghostscript (required by Camelot) sudo apt-get install -y ghostscript # Install python3-tk (required by some PyMuPDF functionalities) sudo apt-get install -y python3-tk # Install OpenCV (via apt, for camelot-py[cv]) sudo apt-get install -y libopencv-dev python3-opencv ``` *Note: If you are running on Windows or macOS, the installation steps for `camelot-py` might differ. Refer to the [Camelot documentation](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) for more details.* 3. **Set up Environment Variables (for Vertex AI Text Embeddings):** ```bash export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4) ``` Replace `your-gcp-project-id` and `us-central1` with your actual Google Cloud Project ID and Vertex AI region. 4. **Place your PDF files:** Create a `docs` directory in the root of the repository and place your PDF documents inside it. ``` pdf-multimodal-multilingual-embedding-pipeline/ ├── docs/ │ └── your_document.pdf └── another_document.pdf ``` 5. **Run the pipeline:** ```bash python run_pipeline.py ``` The generated embedding file (`embeddings_statistiques_multimodal.json`) and extracted assets will be saved in the `output/` directory. ### Google Colab Usage A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments and GPU access. 1. **Open a new Google Colab notebook.** 2. **Change runtime to GPU:** Go to `Runtime > Change runtime type` and select `T4 GPU` or `V100 GPU`. 3. **Install system and Python dependencies:** ```python !pip uninstall -y camelot camelot-py # Ensure clean install !pip install PyMuPDF !apt-get update !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow gemma jax jaxlib numpy ``` 4. **Authenticate to Google Cloud (for Vertex AI):** ```python from google.colab import auth auth.authenticate_user() ``` 5. **Set your Google Cloud Project ID and Location:** ```python import os # Replace with your actual Google Cloud Project ID os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE" # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4") os.environ["VERTEX_AI_LOCATION"] = "us-central1" # Critical: Adjust JAX memory allocation for Gemma os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00" ``` 6. **Upload your PDF files:** You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named `docs` within `/content/`. ```python # Example for uploading from google.colab import files import os from pathlib import Path PDF_DIRECTORY = Path("/content/docs") PDF_DIRECTORY.mkdir(parents=True, exist_ok=True) uploaded = files.upload() for filename in uploaded.keys(): os.rename(filename, PDF_DIRECTORY / filename) ``` 7. **Copy and paste the code from `src/pdf_processor.py`, `src/embedding_utils.py` and `run_pipeline.py` into Colab cells and execute.** Make sure to execute `embedding_utils.py` content first, then `pdf_processor.py` content, then `run_pipeline.py` content, or combine them logically into your notebook. ## Output The pipeline will generate: - `embeddings_statistiques_multimodal.json`: A JSON file containing all generated embeddings and their metadata. - `output/extracted_graphs/`: Directory containing extracted images (PNG format). - `output/extracted_tables/`: Directory containing HTML representations of extracted tables. ## Example `embeddings_statistiques_multimodal.json` Entry ```json [ { "pdf_file": "sample.pdf", "page_number": 1, "chunk_id": "text_0", "content_type": "text", "text_content": "This is a chunk of text extracted from the first page of the document...", "embedding": [0.123, -0.456, ..., 0.789], "pdf_title": "Sample Document", "pdf_subject": "Data Analysis", "pdf_keywords": "statistics, report" }, { "pdf_file": "sample.pdf", "page_number": 2, "chunk_id": "table_0", "content_type": "table", "text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...", "embedding": [-0.987, 0.654, ..., 0.321], "table_html_url": "/static/extracted_tables/sample_p2_table0.html", "image_url": "/static/extracted_graphs/sample_p2_table0.png", "pdf_title": "Sample Document", "pdf_subject": "Data Analysis", "pdf_keywords": "statistics, report" }, { "pdf_file": "sample.pdf", "page_number": 3, "chunk_id": "image_0", "content_type": "image", "text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.", "embedding": [0.456, -0.789, ..., 0.123], "image_url": "/static/extracted_graphs/sample_p3_img0.png", "pdf_title": "Sample Document", "pdf_subject": "Data Analysis", "pdf_keywords": "statistics, report" } ] ``` # Acknowledgments This pipeline leverages the power of: - Gemma AI - Google AI Gemini Models - PyMuPDF - Camelot - Tiktoken - Pandas - BeautifulSoup