Multimodal & Multilingual PDF Embedding Pipeline
This repository hosts a Python pipeline that extracts text, tables, and images from PDF documents, generates multimodal descriptions for visual content (tables and images), and then creates multilingual text embeddings for all extracted information. The generated embeddings are stored in a JSON file, ready for use in Retrieval Augmented Generation (RAG) systems or other downstream applications.
Key Features:
- Multimodal: Processes text, tables, and images from PDFs.
- Multilingual: Leverages Google's
text-multilingual-embedding-002model for embeddings, supporting a wide range of languages. - Contextual Descriptions: Uses Google Gemini (Gemini 1.5 Flash) to generate descriptive text for tables and images in French.
- Structured Output: Stores embeddings and metadata (PDF source, page number, content type, links to extracted assets) in a comprehensive JSON format.
How it Works
- PDF Parsing: Utilizes
PyMuPDFto extract text blocks and images, andCamelotto accurately extract tabular data. - Content Separation: Distinguishes between plain text, tables, and non-table images.
- Multimodal Description (for Tables & Images):
- For tables, the pipeline captures an image of the table and also uses its text representation.
- For standalone images (e.g., graphs, charts), it captures the image.
- These images are then sent to the
gemini-1.5-flash-latestmodel (viagoogle.generativeai) with specific prompts to generate rich, descriptive text in French.
- Multilingual Text Embedding:
- The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the
text-multilingual-embedding-002model (via Vertex AI). - This model generates a high-dimensional embedding vector (768 dimensions) for each piece of content.
- The cleaned text content (original text chunks, or generated descriptions for tables/images) is then passed to the
- JSON Output: All generated embeddings, along with rich metadata (original PDF, page, content type, links to extracted assets), are compiled into a single JSON file.
Requirements & Setup
This pipeline relies on Google Cloud Platform services and specific Python libraries. You will need:
- A Google Cloud Project:
- Enable the Vertex AI API.
- Enable the Generative Language API (for Gemini 1.5 Flash descriptions).
- Authentication:
- Google Cloud Authentication: The easiest way to run this in a Colab environment is using
google.colab.auth.authenticate_user(). For local execution, ensure your Google Cloud SDK is configured and authenticated (gcloud auth application-default login). - Gemini API Key: An API key for the Google AI Gemini models. You can get one from Google AI Studio. Set this as an environment variable or directly in the code (though environment variables are recommended for security).
- Google Cloud Authentication: The easiest way to run this in a Colab environment is using
Local Setup
Clone the repository:
git clone [https://huggingface.co/](https://huggingface.co/)Anonymous1223334444/pdf-multimodal-multilingual-embedding-pipeline cd pdf-multimodal-multilingual-embedding-pipelineInstall dependencies:
pip install -r requirements.txtSystem-level dependencies for Camelot/PyMuPDF (Linux/Colab): You might need to install these system packages for
PyMuPDFandCamelotto function correctly.# Update package list sudo apt-get update # Install Ghostscript (required by Camelot) sudo apt-get install -y ghostscript # Install python3-tk (required by some PyMuPDF functionalities) sudo apt-get install -y python3-tk # Install OpenCV (via apt, for camelot-py[cv]) sudo apt-get install -y libopencv-dev python3-opencvNote: If you are running on Windows or macOS, the installation steps for
camelot-pymight differ. Refer to the Camelot documentation for more details.Set up Environment Variables:
export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export VERTEX_AI_LOCATION="us-central1" # Or your preferred Vertex AI region (e.g., us-east4) export GEMINI_API_KEY="your-gemini-api-key"Replace
your-gcp-project-id,us-central1, andyour-gemini-api-keywith your actual values.Place your PDF files: Create a
docsdirectory in the root of the repository and place your PDF documents inside it.pdf-multimodal-multilingual-embedding-pipeline/ ├── docs/ │ └── your_document.pdf │ └── another_document.pdfRun the pipeline:
python run_pipeline.pyThe generated embedding file (
embeddings_statistiques_multimodal.json) and extracted assets will be saved in theoutput/directory.
Google Colab Usage
A Colab notebook version of this pipeline is ideal for quick experimentation due to pre-configured environments.
- Open a new Google Colab notebook.
- Install system dependencies:
!pip uninstall -y camelot camelot-py # Ensure clean install !pip install PyMuPDF !apt-get update !apt-get install -y ghostscript python3-tk libopencv-dev python3-opencv !pip install camelot-py[cv] google-cloud-aiplatform tiktoken pandas beautifulsoup4 Pillow - Authenticate:
from google.colab import auth auth.authenticate_user() - Set your API Key and Project/Location:
import os # Replace with your actual Gemini API key os.environ["GENAI_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE" # Replace with your actual Google Cloud Project ID os.environ["GOOGLE_CLOUD_PROJECT"] = "YOUR_GCP_PROJECT_ID_HERE" # Set your preferred Vertex AI location (e.g., "us-central1", "us-east4") os.environ["VERTEX_AI_LOCATION"] = "us-central1" - Upload your PDF files:
You can use the Colab file upload feature or mount Google Drive. Ensure your PDFs are in a directory named
docswithin/content/.# Example for uploading from google.colab import files import os PDF_DIRECTORY = Path("/content/docs") PDF_DIRECTORY.mkdir(parents=True, exist_ok=True) uploaded = files.upload() for filename in uploaded.keys(): os.rename(filename, PDF_DIRECTORY / filename) - Copy and paste the code from
run_pipeline.py(andsrc/files if you don't use modules) into Colab cells and execute.
Output
The pipeline will generate:
embeddings_statistiques_multimodal.json: A JSON file containing all generated embeddings and their metadata.output/extracted_graphs/: Directory containing extracted images (PNG format).output/extracted_tables/: Directory containing HTML representations of extracted tables.
Example embeddings_statistiques_multimodal.json Entry
[
{
"pdf_file": "sample.pdf",
"page_number": 1,
"chunk_id": "text_0",
"content_type": "text",
"text_content": "This is a chunk of text extracted from the first page of the document...",
"embedding": [0.123, -0.456, ..., 0.789],
"pdf_title": "Sample Document",
"pdf_subject": "Data Analysis",
"pdf_keywords": "statistics, report"
},
{
"pdf_file": "sample.pdf",
"page_number": 2,
"chunk_id": "table_0",
"content_type": "table",
"text_content": "Description en français du tableau: Ce tableau présente les ventes mensuelles par région. Il inclut les colonnes Mois, Région, et Ventes. La région Nord a la plus forte croissance...",
"embedding": [-0.987, 0.654, ..., 0.321],
"table_html_url": "/static/extracted_tables/sample_p2_table0.html",
"image_url": "/static/extracted_graphs/sample_p2_table0.png",
"pdf_title": "Sample Document",
"pdf_subject": "Data Analysis",
"pdf_keywords": "statistics, report"
},
{
"pdf_file": "sample.pdf",
"page_number": 3,
"chunk_id": "image_0",
"content_type": "image",
"text_content": "Description en français de l'image: Ce graphique est un histogramme montrant la répartition des âges dans la population. L'axe des X représente les tranches d'âge et l'axe des Y la fréquence. La majorité de la population se situe entre 25 et 40 ans.",
"embedding": [0.456, -0.789, ..., 0.123],
"image_url": "/static/extracted_graphs/sample_p3_img0.png",
"pdf_title": "Sample Document",
"pdf_subject": "Data Analysis",
"pdf_keywords": "statistics, report"
}
]
Acknowledgments
This pipeline leverages the power of:
- Google Cloud Vertex AI
- Google AI Gemini Models
- PyMuPDF
- Camelot
- Tiktoken
- Pandas
- BeautifulSoup