SODA-VEC: Scientific Literature Embeddings with VICReg Loss

Model Description

SODA-VEC is a state-of-the-art sentence transformer model specifically designed for scientific literature, trained on a massive dataset of 26+ million title-abstract pairs from PubMed Central (PMC). The model uses a custom VICReg loss configuration optimized for learning robust representations of scientific text.

Key Features

  • Base Model: answerdotai/ModernBERT-base (768-dimensional embeddings)
  • Sequence Length: 512 tokens (optimized for memory efficiency)
  • Training Data: 26,209,161 title-abstract pairs from PMC
  • Loss Configuration: Custom VICReg loss with dot product, variance, and covariance regularization
  • Total Training: 550,000 steps (~3 epochs) with stable convergence

Model Performance

Training Results

The model achieved excellent performance with stable training throughout:

  • Final Training Loss: 0.8022
  • Final Evaluation Loss: 0.8184
  • Training Duration: 2.915 days
  • Total Steps: 500,000 (resumed from step 100,000)
  • Evaluation Dataset: 50% of test set (132,369 samples)

Training Curves

The model shows excellent learning dynamics:

  1. Initial Phase (0-100k steps): Rapid learning from ~0.95 to ~0.85-0.90
  2. Mid-Phase (100k-300k steps): Gradual improvement to ~0.82-0.85
  3. Final Phase (300k-500k steps): Stable convergence to ~0.80-0.82

The evaluation loss shows a clear downward trend from 0.89 to 0.8184, indicating excellent generalization.

Training History

Phase 1: Initial Training (0-106k steps)

  • Configuration: dot_std_cov loss (dot_loss + std_loss + cov_loss)
  • Learning Rate: 2e-4
  • Batch Size: 32 per device, 4 GPUs
  • Result: Training became unstable with loss explosion >250 and gradient norm NaN

Phase 2: Resumed Training (100k-550k steps)

  • Resume Point: Checkpoint-100000 (stable evaluation loss)
  • Configuration: Same dot_std_cov loss
  • Learning Rate: 8e-5 (reduced for stability)
  • Warmup Steps: 1,000
  • Max Grad Norm: 2.0 (increased for stability)
  • Evaluation: 50% of test dataset for efficiency
  • Result: Stable training with excellent convergence

Technical Specifications

Model Architecture

  • Base: ModernBERT-base (22 layers, 768 hidden size, 12 attention heads)
  • Pooling: Mean pooling
  • Embedding Dimension: 768
  • Max Sequence Length: 512 tokens
  • Vocabulary Size: 50,368 tokens

Training Configuration

{
    "learning_rate": 8e-5,
    "warmup_steps": 1000,
    "max_grad_norm": 2.0,
    "weight_decay": 0.01,
    "batch_size": 32,
    "gradient_accumulation_steps": 1,
    "fp16": True,
    "lr_scheduler_type": "cosine",
    "max_steps": 500000,
    "eval_steps": 10000,
    "save_steps": 50000
}

Loss Function

The model uses a custom VICReg loss configuration:

loss_config = "dot_std_cov"
# Components:
# - dot_loss: Encourages high cosine similarity between paired embeddings
# - std_loss: Variance regularization to prevent dimensional collapse
# - cov_loss: Covariance regularization to decorrelate features

Usage

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')

# Encode sentences
sentences = [
    "Machine learning approaches for drug discovery",
    "Deep learning methods in computational biology"
]
embeddings = model.encode(sentences)

# Compute similarity
similarity = model.similarity(sentences[0], sentences[1])

Advanced Usage

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')

# Encode with custom parameters
embeddings = model.encode(
    sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Compute pairwise similarities
similarities = model.similarity(sentences, sentences)

Training Data

Dataset

  • Source: EMBO/soda-vec-data-full_pmc_title_abstract
  • Size: 26,209,161 training pairs, 264,739 test pairs
  • Content: Title-abstract pairs from PubMed Central
  • Language: English
  • Domain: Scientific literature across all biomedical fields

Data Processing

  • Train/Test Split: 99% training, 1% testing
  • Evaluation Fraction: 50% of test set used for evaluation
  • Text Processing: Standard sentence transformer preprocessing
  • Tokenization: ModernBERT tokenizer with 512 token limit

Evaluation

The model was evaluated on a held-out test set of 132,369 samples (50% of total test set):

  • Final Evaluation Loss: 0.8184
  • Training-Eval Alignment: Excellent (0.8022 vs 0.8184)
  • Generalization: Strong performance on unseen data
  • Stability: No overfitting observed

Model Files

The model includes all necessary components:

  • config.json: Model configuration
  • model.safetensors: Model weights (596MB)
  • tokenizer.json: Tokenizer configuration
  • tokenizer_config.json: Tokenizer settings
  • sentence_bert_config.json: Sentence transformer configuration
  • modules.json: Model architecture
  • special_tokens_map.json: Special tokens mapping

Citation

If you use this model in your research, please cite:

@misc{soda-vec-dot-std-cov-losses,
  title={SODA-VEC: Scientific Literature Embeddings with VICReg Loss},
  author={EMBO},
  year={2024},
  url={https://huggingface.co/EMBO/soda-vec-dot-std-cov-losses}
}

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Base Model: Built on answerdotai/ModernBERT-base
  • Training Framework: Hugging Face Transformers and Sentence Transformers
  • Data Source: PubMed Central (PMC) via Hugging Face Datasets
  • Infrastructure: EMBO computational resources

Contact

For questions or issues, please contact the EMBO team or open an issue on the model repository.


This model represents a significant advancement in scientific literature embeddings, combining the power of ModernBERT with carefully tuned VICReg loss for optimal representation learning.

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support