--- language: en license: gemma library_name: pytorch tags: - embeddings - sentence-transformers - text-embeddings - semantic-search - int8-quantized - knowledge-distillation - leaf - embeddinggemma pipeline_tag: feature-extraction base_model: google/embeddinggemma-300m --- # LEAF Embeddings - INT8 Quantized (FAILED v1 - DO NOT USE) **🚨 CRITICAL: This model FAILED quality evaluation - DO NOT USE for production.** **⚠️ This is experiment v1 (512 tokens) - kept for research purposes only.** **Status**: Training completed successfully but MTEB evaluation shows critical quality loss. This serves as a baseline for comparison with the improved v2 model (2048 tokens, better architecture) currently in development. ## Model Description This model is a **distilled and quantized version** of [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) trained using the **LEAF (Layer-wise Early-exit Alignment Framework)** methodology. It generates 768-dimensional embeddings optimized for fast CPU inference with INT8 quantization. ### What is LEAF? LEAF is a knowledge distillation framework that: - **Compresses** larger embedding models into smaller, faster versions - **Preserves** semantic quality through multi-objective training (distillation + alignment + contrastive losses) - **Optimizes** for CPU deployment with INT8 post-training quantization ### Architecture | Property | This Model (LEAF) | Base Model (EmbeddingGemma-300m) | |----------|-------------------|----------------------------------| | **Dimensions** | 768D | 768D (also 512D, 256D, 128D via Matryoshka) | | **Parameters** | ~75M (6 layers, compressed) | 300M (full architecture) | | **Max Tokens** | 512 | 2048 | | **Quantization** | INT8 (441MB) | FP32 (~600MB) | | **Inference Speed** | 695 texts/s (CPU) | ~50-100 texts/s (CPU) | **Trade-offs**: - ✅ **6-10x faster** inference on CPU - ✅ **Smaller model size** (441MB vs ~600MB) - ✅ **Lower memory** footprint - ⚠️ **Reduced context** length (512 vs 2048 tokens) - ⚠️ **Possible quality loss** from distillation (not yet benchmarked) ## Performance ### Inference Speed (CPU) - **Throughput**: 695 texts/second - **Latency**: ~1.4ms per text - **Memory**: ~500MB RAM - **Hardware**: Standard CPU, no GPU required ### ❌ ACTUAL QUALITY (MTEB Evaluation - FAILED) **Evaluation Date**: 2025-10-12 **Status**: ❌ **CRITICAL FAILURE** - Model does not capture semantic relationships | Dataset | Metric | This Model (v1) | Base Model | Quality Loss | |---------|--------|-----------------|------------|--------------| | **STSBenchmark** | Spearman | **0.223** | 0.81 | **-72%** ❌ | | **STS22 English** | Spearman | **0.373** | 0.75 | **-50%** ❌ | | **STS22 Average** | Spearman | **~0.21** | 0.65 | **-68%** ❌ | | **Cross-lingual** | Spearman | **-0.14 to 0.12** | 0.55 | **Complete loss** ❌ | **Detailed STS22 Results by Language**: | Language | Spearman | Status | |----------|----------|--------| | 🇨🇳 Chinese | 0.499 | 🟡 Moderate (best) | | 🇸🇦 Arabic | 0.469 | 🟡 Moderate | | 🇮🇹 Italian | 0.435 | 🟡 Moderate | | 🇪🇸 Spanish | 0.403 | 🟠 Poor | | 🇬🇧 English | 0.373 | 🟠 Poor | | 🇫🇷 French | 0.300 | 🔴 Very poor | | 🇷🇺 Russian | 0.268 | 🔴 Very poor | | 🇹🇷 Turkish | 0.247 | 🔴 Very poor | | 🇩🇪 German | 0.163 | ❌ Critical | | 🇵🇱 Polish | 0.132 | ❌ Critical | **Cross-lingual pairs (translation tasks)**: All **FAILED** (scores 0.002 to -0.143) **Conclusion**: This model **cannot be used for semantic search, similarity tasks, or any production use**. The embeddings do not preserve semantic meaning from the base model. ### Training Quality Analysis **Training Metrics** (from [WandB logs](https://wandb.ai/seedship/embeddinggemma-leaf/runs/savq3l32)): | Metric | Final Value | Status | |--------|-------------|--------| | Distillation Loss | 0.976 | ✅ Good - Model learned from teacher | | Alignment Loss | 2.18 | ⚠️ Moderate - Semantic space alignment could improve | | Training Steps | 12,500 (3 epochs) | ✅ Complete | | Training Time | 2h10min | ✅ Efficient | | Eval Loss | NaN | ❌ Bug in evaluation aggregation | **Observations**: - ✅ Training **converged smoothly** without crashes - ✅ Distillation loss **stable and low** (0.976) - good knowledge transfer - ⚠️ Alignment loss **moderate** (2.18) - room for improvement - ❌ Evaluation metrics **not computed** (NaN) - needs separate MTEB evaluation - 📊 **17 checkpoints saved** - can select best performing model **Quality Verdict**: ❌ **FAILED** - Despite low distillation loss, the model failed to learn semantic representations. ### 🔍 Failure Analysis **What went wrong**: 1. **Architecture Too Aggressive** ❌ - 6 layers too small for semantic preservation (should be 12+) - 4x compression (300M→75M) lost critical information - Hidden size ratio 0.5x insufficient 2. **Insufficient Training Data** ❌ - Only 50k samples for 100+ languages - Mostly English data (NLI, STS, MS MARCO) - No multilingual balance 3. **Misleading Distillation Loss** ⚠️ - Low distillation loss (0.976) doesn't guarantee semantic quality - **High alignment loss (2.18) was the real warning sign** - Model learned to mimic output distribution but not semantic meaning 4. **Evaluation Bug** ❌ - Eval loss = NaN prevented early detection of failure - Should have caught quality issues during training **Lessons learned for v2**: - ✅ Monitor **alignment loss** as primary metric (target: <1.0) - ✅ Increase student size to 120M params (12 layers) - ✅ Use 200k+ multilingual samples - ✅ Implement proper eval during training (MTEB subset every 500 steps) - ✅ Train with 2048 token context - ✅ Curriculum learning: 512→1024→2048 tokens progressively ## Training Details ### Methodology 1. **Knowledge Distillation** from EmbeddingGemma-300m (300M → 75M params) 2. **LEAF Framework** with multi-objective training: - Distillation loss (0.5 weight) - Alignment loss (1.0 weight) - Contrastive loss (0.3 weight) 3. **INT8 Quantization** for CPU optimization ### Training Configuration - **Teacher Model**: `google/embeddinggemma-300m` - **Training Data**: 50,000 samples from: - `sentence-transformers/all-nli` - `sentence-transformers/stsb` - `ms_marco` - **Validation**: 5,000 samples - **Training Steps**: 12,500 (3 epochs) - **Hardware**: NVIDIA RTX 4050 (6GB VRAM) - **Training Time**: ~2h10min - **Final Losses**: - Distillation: 0.976 - Alignment: 2.18 ### Student Architecture - **Layers**: 6 (vs more in teacher) - **Attention Heads**: 6 - **Hidden Size Ratio**: 0.5x - **Compression Ratio**: 4x ### Training Logs View full training metrics on [WandB](https://wandb.ai/seedship/embeddinggemma-leaf/runs/savq3l32) ## Usage ### Requirements ```bash pip install torch>=2.6.0 transformers>=4.57.0 huggingface-hub ``` ### Basic Usage ```python import torch from transformers import AutoTokenizer from huggingface_hub import hf_hub_download # Download model model_path = hf_hub_download( repo_id="tss-deposium/gemma300-leaf-embeddings-test", filename="model_quantized.pt" ) # Load model checkpoint = torch.load(model_path, map_location='cpu', weights_only=False) model = checkpoint['model'] model.eval() # Load tokenizer tokenizer = AutoTokenizer.from_pretrained( "tss-deposium/gemma300-leaf-embeddings-test" ) model.set_tokenizer(tokenizer) # Generate embeddings texts = ["Hello world", "Machine learning"] with torch.no_grad(): embeddings = model.encode(texts, device='cpu', normalize=True) print(embeddings.shape) # (2, 768) ``` ### API Integration This model is deployed as part of a FastAPI service: ```python import requests response = requests.post( "https://your-api-url/api/embed", json={"model": "leaf", "input": "Your text here"} ) embeddings = response.json()["embeddings"] ``` ## Model Card | Property | Value | |----------|-------| | **Base Model** | google/embeddinggemma-300m | | **Framework** | LEAF (Knowledge Distillation) | | **Model Type** | Sentence Embeddings | | **Dimensions** | 768 | | **Max Tokens** | 512 (reduced from 2048 for efficiency) | | **Quantization** | INT8 | | **PyTorch Version** | 2.6+ | | **Language** | English (base model supports 100+ languages) | | **Training Dataset** | 50k samples (NLI, STS, MS MARCO) | ## Files - `model_quantized.pt` (441MB) - INT8 quantized model for CPU inference - `model_fp32.pt` (477MB) - FP32 full precision version (optional) - `tokenizer.json` (33MB) - Tokenizer vocabulary - `config.json` - Model configuration - `tokenizer_config.json` - Tokenizer settings ## Limitations ### Context Length - **512 tokens maximum** (vs 2048 in base model) - Longer texts will be truncated - Consider chunking for documents >512 tokens ### Quality Trade-offs - **Distillation**: Compressed from 300M → 75M parameters may reduce quality - **Quantization**: INT8 quantization may introduce small accuracy loss - **Training Data**: 50k samples may not cover all domains ### Language Support - Primarily tested on **English** - Base model supports 100+ languages, but distilled model not yet evaluated on multilingual tasks ### Experimental Status - **Not production-ready**: Requires thorough evaluation - **No MTEB scores**: Quality benchmarks pending - **Limited testing**: More evaluation needed on downstream tasks ## ❌ DO NOT USE - Model Failed Quality Checks **This model is NOT suitable for ANY production use cases.** **❌ NOT suitable for**: - ❌ **Semantic search** - Scores too low (0.22 Spearman) - ❌ **Document similarity** - Does not capture semantic meaning - ❌ **Text clustering** - Embeddings not semantically meaningful - ❌ **Information retrieval** - Poor correlation with human judgments - ❌ **Duplicate detection** - Unreliable similarity scores - ❌ **Any production deployment** - Quality insufficient - ❌ **Multilingual tasks** - Cross-lingual capabilities destroyed - ❌ **Mission-critical applications** - Do not use **✅ Only suitable for**: - ✅ **Research purposes** - Understanding failure modes in knowledge distillation - ✅ **Baseline comparison** - For comparing with improved v2 model - ✅ **Educational purposes** - Learning what NOT to do in model compression ## Comparison with Base Model | Metric | LEAF v1 (This Model) | EmbeddingGemma-300m | Quality Gap | |--------|----------------------|---------------------|-------------| | **Parameters** | ~75M | 300M | -75% | | **Size (INT8/FP32)** | 441MB | ~600MB | -26% ✅ | | **Speed (CPU)** | 695 texts/s | ~50-100 texts/s | +6-10x ✅ | | **Context Length** | 512 | 2048 | -75% ❌ | | **STSBenchmark** | 0.223 | 0.81 | **-72%** ❌ | | **STS22 English** | 0.373 | 0.75 | **-50%** ❌ | | **MTEB Score (est.)** | ~25 | 61.15 | **-59%** ❌ | | **Latency** | ~1.4ms | ~10-20ms | -85% ✅ | **Verdict**: **Speed improvements do NOT justify the catastrophic quality loss. Use base model instead.** ## Future Work - Version 2 (In Development) **Based on lessons learned from this failed v1 experiment, we are developing v2 with:** ### Architecture Improvements - ✅ **12 layers** (vs 6 in v1) - 2x deeper for semantic preservation - ✅ **120M parameters** (vs 75M) - Less aggressive compression (2.5x vs 4x) - ✅ **2048 token context** (vs 512) - Full context length like base model - ✅ **Hidden size ratio 0.75** (vs 0.5) - Better capacity ### Training Improvements - ✅ **200k samples** (vs 50k) - 4x more data - ✅ **Multilingual balanced** - 100+ languages with proper distribution - ✅ **Curriculum learning** - Progressive 512→1024→2048 tokens - ✅ **10 epochs** (vs 3) - More training time - ✅ **Alignment loss priority** - Weight 2.5 (vs 1.0) + triplet loss ### Evaluation Improvements - ✅ **Eval every 500 steps** - Early detection of quality issues - ✅ **MTEB subset validation** - STSBenchmark during training - ✅ **Alignment loss < 1.0 target** - Primary quality metric - ✅ **Early stopping** - On alignment loss, not distillation loss ### Quality Targets (v2) - 🎯 **STSBenchmark**: 0.70+ Spearman (vs 0.22 in v1) - 🎯 **STS22 Average**: 0.50+ Spearman (vs 0.21 in v1) - 🎯 **MTEB Score**: 55+ (vs ~25 estimated in v1) - 🎯 **Cross-lingual**: 0.30+ (vs -0.14 in v1) **Expected release**: After full training and validation (~12-15 hours on RTX 4050) ## Citation ```bibtex @misc{leaf-embeddings-test, author = {TSS Deposium}, title = {LEAF Embeddings INT8 - Distilled from EmbeddingGemma-300m}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/tss-deposium/gemma300-leaf-embeddings-test}}, note = {Based on google/embeddinggemma-300m} } @misc{embeddinggemma, author = {Google}, title = {EmbeddingGemma-300m}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/google/embeddinggemma-300m}} } ``` ## Acknowledgments - **Base Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - **Training Framework**: Custom LEAF implementation - **Datasets**: Sentence Transformers, MS MARCO ## Contact For questions or issues, please open an issue on the model repository. --- **Disclaimer**: This is an experimental model for testing purposes. Performance and quality may vary. Thorough evaluation recommended before production use.