SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the ๐Ÿค— Hub
model = SentenceTransformer("model")
# Run inference
sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine nan
spearman_cosine nan

Binary Classification

Metric Value
cosine_accuracy 0.8
cosine_accuracy_threshold 0.6527
cosine_f1 0.8889
cosine_f1_threshold 0.6527
cosine_precision 1.0
cosine_recall 0.8
cosine_ap 1.0
cosine_mcc 0.0

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 3,696 training samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 37 tokens
    • mean: 40.4 tokens
    • max: 44 tokens
    • min: 49 tokens
    • mean: 62.2 tokens
    • max: 85 tokens
    • 1: 100.00%
  • Samples:
    sentence1 sentence2 label
    Quelle proportion des dรฉpenses pour l'installation de bornes de recharge รฉlectrique peut รชtre couverte par la rรฉgion รŽle-de-France? Nature de l'aide: La Rรฉgion participera ร  hauteur de 50% de la dรฉpense supportรฉe par le maรฎtre dโ€™ouvrage plafonnรฉe en fonction du type de bornes 1
    Quels types de projets sont รฉligibles pour obtenir un financement de la Rรฉgion รŽle-de-France dans le cadre du dรฉveloppement de l'รฉlectromobilitรฉ? Type de project: Le dispositif a pour objet le financement : Des รฉtudes dโ€™รฉlaboration dโ€™un document stratรฉgique,De lโ€™installation ou la mise ร  niveau des IRVE situรฉes sur le domaine public francilien, respectant les critรจres du label rรฉgional et sโ€™inscrivant dans un plan dโ€™actions 1
    Quelle est la dรฉmarche ร  suivre pour dรฉposer une demande de subvention concernant l'รฉlectromobilitรฉ en รŽle-de-France? Procรฉdures et dรฉmarches: Dรฉposez sur mesdemarches.iledefrance.fr votre dossier de demande de subvention prรฉsentant le projet de maniรจre prรฉcise et comportant toutes les piรจces permettant lโ€™instruction du dossier, rรฉputรฉ complet, par les services de la Rรฉgion 1
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

json

  • Dataset: json
  • Size: 687 evaluation samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 687 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 24 tokens
    • mean: 33.6 tokens
    • max: 42 tokens
    • min: 37 tokens
    • mean: 90.4 tokens
    • max: 257 tokens
    • 1: 100.00%
  • Samples:
    sentence1 sentence2 label
    Sous quelles conditions mon centre de formation en apprentissage peut-il รชtre รฉligible ร  une subvention pour des investissements? Le dispositif est accessible ร  tous les OFA sous rรฉserve de remplir les 5 conditions suivantes : Dispenser une activitรฉ apprentissage ayant obtenu une certification,Dispenser des formations en apprentissage sur le territoire francilien depuis au moins 1 an en qualitรฉ de CFA, dโ€™OFA ou dโ€™UFA,Prรฉsenter un projet dโ€™investissement prรฉvu pour la dispense de formations en apprentissage sur le territoire francilien,รŠtre propriรฉtaire du bien pour lequel une subvention est sollicitรฉe ou titulaire dโ€™un bail rรฉcemment renouvelรฉ (ou engagement du propriรฉtaire ร  renouveler le bail), en propre ou sous la forme de SCI, et assurant la maรฎtrise dโ€™ouvrage des travaux dโ€™investissement,Prรฉsenter un besoin de financement sur le projet dโ€™investissement ne pouvant รชtre pris en charge au titre des fonds propres de la structure et de tiers financeurs 1
    Est-ce que ma structure qui dispense des formations en apprentissage doit avoir une certaine anciennetรฉ pour bรฉnรฉficier de l'aide rรฉgionale? Dispenser des formations en apprentissage sur le territoire francilien depuis au moins 1 an en qualitรฉ de CFA, dโ€™OFA ou dโ€™UFA 1
    Comment dois-je procรฉder pour soumettre ma demande de soutien ร  l'investissement pour mon organisme de formation? L'organisme doit dรฉposer sa demande et les piรจces justificatives via le portail mesdemarches.iledefrance.fr 1
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Validation Loss EmbeddingSimEval_spearman_cosine BinaryClassifEval_cosine_ap
1.0 3 0.2267 nan 1.0
2.0 6 0.2448 nan 1.0

Framework Versions

  • Python: 3.11.9
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.3.0
  • Accelerate: 1.1.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for romain125/model

Finetuned
(429)
this model

Evaluation results