SentenceTransformer based on thenlper/gte-base

This is a sentence-transformers model finetuned from thenlper/gte-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: thenlper/gte-base
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Phylogenetic analysis of mitochondrial genes in Macquarie perch from three river basins',
    'Genetic variation in mitochondrial genes could underlie metabolic adaptations because mitochondrially encoded proteins are directly involved in a pathway supplying energy to metabolism. Macquarie perch from river basins exposed to different climates differ in size and growth rate, suggesting potential presence of adaptive metabolic differences. We used complete mitochondrial genome sequences to build a phylogeny, estimate lineage divergence times and identify signatures of purifying and positive selection acting on mitochondrial genes for 25 Macquarie perch from three basins: Murray-Darling Basin (MDB), Hawkesbury-Nepean Basin (HNB) and Shoalhaven Basin (SB). Phylogenetic analysis resolved basin-level clades, supporting incipient speciation previously inferred from differentiation in allozymes, microsatellites and mitochondrial control region. The estimated time of lineage divergence suggested an early- to mid-Pleistocene split between SB and the common ancestor of HNB+MDB, followed by mid-to-late Pleistocene splitting between HNB and MDB. These divergence estimates are more recent than previous ones. Our analyses suggested that evolutionary drivers differed between inland MDB and coastal HNB. In the cooler and more climatically variable MDB, mitogenomes evolved under strong purifying selection, whereas in the warmer and more climatically stable HNB, purifying selection was relaxed. Evidence for relaxed selection in the HNB includes elevated transfer RNA and 16S ribosomal RNA polymorphism, presence of potentially mildly deleterious mutations and a codon (ATP6',
    'An improved Bayesian method is presented for estimating phylogenetic trees using DNA sequence data. The birth-death process with species sampling is used to specify the prior distribution of phylogenies and ancestral speciation times, and the posterior probabilities of phylogenies are used to estimate the maximum posterior probability (MAP) tree. Monte Carlo integration is used to integrate over the ancestral speciation times for particular trees. A Markov Chain Monte Carlo method is used to generate the set of trees with the highest posterior probabilities. Methods are described for an empirical Bayesian analysis, in which estimates of the speciation and extinction rates are used in calculating the posterior probabilities, and a hierarchical Bayesian analysis, in which these parameters are removed from the model by an additional integration. The Markov Chain Monte Carlo method avoids the requirement of our earlier method for calculating MAP trees to sum over all possible topologies (which limited the number of taxa in an analysis to about five). The methods are applied to analyze DNA sequences for nine species of primates, and the MAP tree, which is identical to a maximum-likelihood estimate of topology, has a probability of approximately 95%.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.9449, 0.8056],
#         [0.9449, 1.0000, 0.7868],
#         [0.8056, 0.7868, 1.0000]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 95,253 training samples
  • Columns: sentence_0, sentence_1, and sentence_2
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 sentence_2
    type string string string
    details
    • min: 6 tokens
    • mean: 19.51 tokens
    • max: 56 tokens
    • min: 3 tokens
    • mean: 223.97 tokens
    • max: 512 tokens
    • min: 51 tokens
    • mean: 309.24 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1 sentence_2
    Sox5 modulates the activity of Sox10 in the melanocyte lineage The transcription factor Sox5 has previously been shown in chicken to be expressed in early neural crest cells and neural crest-derived peripheral glia. Here, we show in mouse that Sox5 expression also continues after neural crest specification in the melanocyte lineage. Despite its continued expression, Sox5 has little impact on melanocyte development on its own as generation of melanoblasts and melanocytes is unaltered in Sox5-deficient mice. Loss of Sox5, however, partially rescued the strongly reduced melanoblast generation and marker gene expression in Sox10 heterozygous mice arguing that Sox5 functions in the melanocyte lineage by modulating Sox10 activity. This modulatory activity involved Sox5 binding and recruitment of CtBP2 and HDAC1 to the regulatory regions of melanocytic Sox10 target genes and direct inhibition of Sox10-dependent promoter activation. Both binding site competition and recruitment of corepressors thus help Sox5 to modulate the activity of Sox10 in the melano... Transcripts for a new form of Sox5, called L-Sox5, and Sox6 are coexpressed with Sox9 in all chondrogenic sites of mouse embryos. A coiled-coil domain located in the N-terminal part of L-Sox5, and absent in Sox5, showed >90% identity with a similar domain in Sox6 and mediated homodimerization and heterodimerization with Sox6. Dimerization of L-Sox5/Sox6 greatly increased efficiency of binding of the two Sox proteins to DNA containing adjacent HMG sites. L-Sox5, Sox6 and Sox9 cooperatively activated expression of the chondrocyte differentiation marker Col2a1 in 10T1/2 and MC615 cells. A 48 bp chondrocyte-specific enhancer in this gene, which contains several HMG-like sites that are necessary for enhancer activity, bound the three Sox proteins and was cooperatively activated by the three Sox proteins in non-chondrogenic cells. Our data suggest that L-Sox5/Sox6 and Sox9, which belong to two different classes of Sox transcription factors, cooperate with each other in expression of Col2a1 a...
    are asgard archaea related to eukaryotes Asgard archaea are considered to be the closest known relatives of eukaryotes. Their genomes contain hundreds of eukaryotic signature proteins (ESPs), which inspired hypotheses on the evolution of the eukaryotic cell Eukaryotes evolved from a symbiosis involving alphaproteobacteria and archaea phylogenetically nested within the Asgard clade. Two recent studies explore the metabolic capabilities of Asgard lineages, supporting refined symbiotic metabolic interactions that might have operated at the dawn of eukaryogenesis.
    Fanconi Anemia in Pediatric Medulloblastoma and Fanconi Anemia The outcome of children with medulloblastoma (MB) and Fanconi Anemia (FA), an inherited DNA repair deficiency, has not been described systematically. Treatment is complicated by high vulnerability to treatment-associated side effects, yet structured data are lacking. This study aims to give a comprehensive overview of clinical and molecular characteristics of pediatric FA MB patients. The Sonic Hedgehog (SHH) signaling pathway is indispensable for development, and functions to activate a transcriptional program modulated by the GLI transcription factors. Here, we report that loss of a regulator of the SHH pathway, Suppressor of Fused (Sufu), resulted in early embryonic lethality in the mouse similar to inactivation of another SHH regulator, Patched1 (Ptch1). In contrast to Ptch1+/- mice, Sufu+/- mice were not tumor prone. However, in conjunction with p53 loss, Sufu+/- animals developed tumors including medulloblastoma and rhabdomyosarcoma. Tumors present in Sufu+/-p53-/- animals resulted from Sufu loss of heterozygosity. Sufu+/-p53-/- medulloblastomas also expressed a signature gene expression profile typical of aberrant SHH signaling, including upregulation of N-myc, Sfrp1, Ptch2 and cyclin D1. Finally, the Smoothened inhibitor, hedgehog antagonist, did not block growth of tumors arising from Sufu inactivation. These data demonstrate that Sufu is essential for deve...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 1
  • max_steps: 20
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: 20
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 5.0.0
  • Transformers: 4.52.4
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.6.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

If our work was helpful conside citing us ☺️

@misc{sinha2025bicaeffectivebiomedicaldense,
      title={BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives}, 
      author={Aarush Sinha and Pavan Kumar S and Roshan Balaji and Nirav Pravinbhai Bhatt},
      year={2025},
      eprint={2511.08029},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2511.08029}, 
}
Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bisectgroup/BiCA-base

Base model

thenlper/gte-base
Finetuned
(18)
this model

Collection including bisectgroup/BiCA-base