SentenceTransformer based on intfloat/multilingual-e5-large-instruct

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large-instruct on the embeddings-train-semantic dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: intfloat/multilingual-e5-large-instruct
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- embeddings-train-semantic

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Lauther/emb-multilingual-e5-large-instruct-3e")
# Run inference
sentences = [
    'What columns store the uncertainty values?',
    'How are flow computers and measurement systems related?\nFlow computers can have multiple systems assigned to them. However, a measurement system can only be assigned to one flow computer.\n\nDatabase terminology:\nIn the database, this relationship is referred to as:\n- Meter streams\n- Meter runs\n- Sections\n\nStorage of the relationship:\nThe relationship between a flow computer and its assigned measurement system is stored in a special table.\n\nUser context:\nWhen a user refers to a "meter stream," they are indicating that they are searching for a measurement system assigned to a specific flow computer.',
    'What is uncertainty?\nUncertainty is a measure of confidence in the precision and reliability of results obtained from equipment or measurement systems. It quantifies the potential error or margin of error in measurements.\n\nTypes of uncertainty:\nThere are two main types of uncertainty:\n1. Uncertainty of magnitudes (variables):\n    - Refers to the uncertainty of specific variables, such as temperature or pressure.\n    - It is calculated after calibrating a device or obtained from the equipment manufacturer\'s manual.\n    - This uncertainty serves as a starting point for further calculations related to the equipment.\n\n2. Uncertainty of the measurement system:\n    - Refers to the uncertainty calculated for the overall flow measurement.\n    - It depends on the uncertainties of the individual variables (magnitudes) and represents the combined margin of error for the entire system.\n\nKey points:\n- The uncertainties of magnitudes (variables) are the foundation for calculating the uncertainty of the measurement system. Think of them as the "building blocks."\n- Do not confuse the two types of uncertainty:\n    - **Uncertainty of magnitudes/variables**: Specific to individual variables (e.g., temperature, pressure).\n    - **Uncertainty of the measurement system**: Specific to the overall flow measurement.\n\nDatabase storage for uncertainties:\nIn the database, uncertainty calculations are stored in two separate tables:\n1. Uncertainty of magnitudes (variables):\n    - Stores the uncertainty values for specific variables (e.g., temperature, pressure).\n\n2. Uncertainty of the measurement system:\n    - Stores the uncertainty values for the overall flow measurement system.\n\nHow to retrieve uncertainty data:\n- To find the uncertainty of the measurement system, join the measurement systems table with the uncertainty of the measurement system table.\n- To find the uncertainty of a specific variable (magnitude), join the measurement systems table with the uncertainty of magnitudes (variables) table.\n\nImportant note:\nDo not confuse the two types of uncertainty:\n- If the user requests the uncertainty of the measurement system, use the first join (measurement systems table + uncertainty of the measurement system table).\n- If the user requests the uncertainty of a specific variable (magnitude) in a report, use the second join (measurement systems table + uncertainty of magnitudes table).',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

embeddings-train-semantic

Dataset: embeddings-train-semantic at ce90f53
Size: 5,220 training samples
Columns: sentence1, sentence2, and score
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 score
type string string float
details
min: 8 tokens
mean: 18.3 tokens
max: 102 tokens

min: 120 tokens
mean: 257.3 tokens
max: 512 tokens

min: 0.0
mean: 0.23
max: 1.0

	sentence1	sentence2	score
type	string	string	float
details	min: 8 tokens mean: 18.3 tokens max: 102 tokens	min: 120 tokens mean: 257.3 tokens max: 512 tokens	min: 0.0 mean: 0.23 max: 1.0

Samples:

sentence1	sentence2	score
`What is the data type of differential pressure in the measurement system?`	What is uncertainty? Uncertainty is a measure of confidence in the precision and reliability of results obtained from equipment or measurement systems. It quantifies the potential error or margin of error in measurements. Types of uncertainty: There are two main types of uncertainty: 1. Uncertainty of magnitudes (variables): - Refers to the uncertainty of specific variables, such as temperature or pressure. - It is calculated after calibrating a device or obtained from the equipment manufacturer's manual. - This uncertainty serves as a starting point for further calculations related to the equipment. 2. Uncertainty of the measurement system: - Refers to the uncertainty calculated for the overall flow measurement. - It depends on the uncertainties of the individual variables (magnitudes) and represents the combined margin of error for the entire system. Key points: - The uncertainties of magnitudes (variables) are the foundation for calculating the uncertainty of ...	`0.15000000000000002`
`What is the structure of the &&&equipment_data&&& table?`	How are flow computers and measurement systems related? Flow computers can have multiple systems assigned to them. However, a measurement system can only be assigned to one flow computer. Database terminology: In the database, this relationship is referred to as: - Meter streams - Meter runs - Sections Storage of the relationship: The relationship between a flow computer and its assigned measurement system is stored in a special table. User context: When a user refers to a "meter stream," they are indicating that they are searching for a measurement system assigned to a specific flow computer.	`0.35000000000000003`
`Find the columns in the flow computer table that identify the flow computer.`	What kind of data store an equipment? Equipments can capture meteorological data, such as pressure, temperature, and volume (magnitudes). This data is essential for users to perform various calculations. Data storage: - The measured values are stored in a special table in the database for magnitudes. This table contains the values of the variables captured by the equipments. - These values are direct measurements from the fluid (e.g., raw pressure, temperature, or volume readings). They are not calculated values, such as uncertainty. - The values stored in the variable values table are different from variable uncertainty values, which are calculated separately and represent the margin of error. Accessing the data: - Users typically access the data by referring to the readings from the measurement system, not directly from the individual equipments. - The readings are stored in a "variable values" table within the database. Linking variable names: If the user needs to kno...	`0.1`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Evaluation Dataset

embeddings-train-semantic

Dataset: embeddings-train-semantic at ce90f53
Size: 652 evaluation samples
Columns: sentence1, sentence2, and score
Approximate statistics based on the first 652 samples:
sentence1 sentence2 score
type string string float
details
min: 8 tokens
mean: 17.8 tokens
max: 102 tokens

min: 120 tokens
mean: 253.84 tokens
max: 512 tokens

min: 0.0
mean: 0.24
max: 0.9

	sentence1	sentence2	score
type	string	string	float
details	min: 8 tokens mean: 17.8 tokens max: 102 tokens	min: 120 tokens mean: 253.84 tokens max: 512 tokens	min: 0.0 mean: 0.24 max: 0.9

Samples:

sentence1	sentence2	score
`How can I filter uncertainty reports by equipment tag?`	How does a flow computer generate and store reports? A flow computer generates daily or hourly reports to provide users with operational data. These reports are stored in the flow computer's memory in an organized format. Report structure: - Each report includes: - Date and time of the data recording. - Data recorded from flow computers. Data storage in tables: The reports are saved in two tables: 1. Main table (Index): - Stores the date, time, and flow computer identifier. 2. Detail table: - Stores the measured values associated with the report. Connection to the Modbus table: The flow computer's reports are linked to a Modbus table. This table contains the names corresponding to each value in the reports, making it easier to interpret the data.	`0.09999999999999999`
`What is the purpose of the flow_data table?`	What is uncertainty? Uncertainty is a measure of confidence in the precision and reliability of results obtained from equipment or measurement systems. It quantifies the potential error or margin of error in measurements. Types of uncertainty: There are two main types of uncertainty: 1. Uncertainty of magnitudes (variables): - Refers to the uncertainty of specific variables, such as temperature or pressure. - It is calculated after calibrating a device or obtained from the equipment manufacturer's manual. - This uncertainty serves as a starting point for further calculations related to the equipment. 2. Uncertainty of the measurement system: - Refers to the uncertainty calculated for the overall flow measurement. - It depends on the uncertainties of the individual variables (magnitudes) and represents the combined margin of error for the entire system. Key points: - The uncertainties of magnitudes (variables) are the foundation for calculating the uncertainty of ...	`0.15000000000000002`
`What is the column name for the report date in the Reports table?`	What is equipment calibration? Calibration is a metrological verification process used to ensure the accuracy of measurement equipment. It is performed periodically, based on intervals set by the company or a regulatory body. Purpose of calibration: The calibration process corrects any deviations in how the equipment measures physical magnitudes (variables). This ensures the equipment provides accurate and reliable data. Calibration cycles: There are two main calibration cycles: 1. As-found: Represents the equipment's measurement accuracy before any adjustments are made. This cycle is almost always implemented. 2. As-left: Represents the equipment's measurement accuracy after adjustments are made. This cycle is used depending on regulatory requirements. Calibration uncertainty: - Uncertainty is included in the results of a calibration. - Calibration uncertainty refers to the margin of error in the device's measurements, which also affects the uncertainty of the measured variable or ...	`0.1`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2e-05
warmup_ratio: 0.1

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 4
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	Validation Loss
0.0307	10	1.5374	-
0.0613	20	1.0251	-
0.0920	30	0.361	-
0.1226	40	0.1819	-
0.1533	50	0.186	-
0.1839	60	0.1697	-
0.2146	70	0.1437	-
0.2452	80	0.172	-
0.2759	90	0.1199	-
0.3065	100	0.1278	-
0.3372	110	0.1037	-
0.3678	120	0.1156	-
0.3985	130	0.0971	-
0.4291	140	0.0911	-
0.4598	150	0.1158	0.0249
0.4904	160	0.0906	-
0.5211	170	0.106	-
0.5517	180	0.0921	-
0.5824	190	0.0748	-
0.6130	200	0.0741	-
0.6437	210	0.0894	-
0.6743	220	0.0815	-
0.7050	230	0.0771	-
0.7356	240	0.1156	-
0.7663	250	0.0857	-
0.7969	260	0.0566	-
0.8276	270	0.0716	-
0.8582	280	0.0662	-
0.8889	290	0.0963	-
0.9195	300	0.0678	0.0212
0.9502	310	0.077	-
0.9808	320	0.0642	-
1.0092	330	0.0725	-
1.0398	340	0.0701	-
1.0705	350	0.0549	-
1.1011	360	0.0699	-
1.1318	370	0.0714	-
1.1625	380	0.0745	-
1.1931	390	0.0754	-
1.2238	400	0.0486	-
1.2544	410	0.047	-
1.2851	420	0.076	-
1.3157	430	0.0689	-
1.3464	440	0.0629	-
1.3770	450	0.0657	0.0178
1.4077	460	0.0622	-
1.4383	470	0.0657	-
1.4690	480	0.0498	-
1.4996	490	0.0653	-
1.5303	500	0.0715	-
1.5609	510	0.0615	-
1.5916	520	0.0441	-
1.6222	530	0.0566	-
1.6529	540	0.0524	-
1.6835	550	0.0423	-
1.7142	560	0.0441	-
1.7448	570	0.0553	-
1.7755	580	0.0572	-
1.8061	590	0.0686	-
1.8368	600	0.06	0.0146
1.8674	610	0.0562	-
1.8981	620	0.0517	-
1.9287	630	0.0498	-
1.9594	640	0.0424	-
1.9900	650	0.0729	-
2.0184	660	0.0347	-
2.0490	670	0.06	-
2.0797	680	0.0441	-
2.1103	690	0.0409	-
2.1410	700	0.0416	-
2.1716	710	0.0345	-
2.2023	720	0.024	-
2.2330	730	0.0458	-
2.2636	740	0.0465	-
2.2943	750	0.0494	0.0132
2.3249	760	0.0388	-
2.3556	770	0.0363	-
2.3862	780	0.0441	-
2.4169	790	0.0378	-
2.4475	800	0.0484	-
2.4782	810	0.051	-
2.5088	820	0.0464	-
2.5395	830	0.036	-
2.5701	840	0.0423	-
2.6008	850	0.0278	-
2.6314	860	0.0474	-
2.6621	870	0.0357	-
2.6927	880	0.0386	-
2.7234	890	0.0334	-
2.7540	900	0.0199	0.0127
2.7847	910	0.0381	-
2.8153	920	0.0415	-
2.8460	930	0.0274	-
2.8766	940	0.0353	-
2.9073	950	0.0423	-
2.9379	960	0.0267	-
2.9686	970	0.042	-

Framework Versions

Python: 3.11.0
Sentence Transformers: 3.4.0
Transformers: 4.48.1
PyTorch: 2.5.1+cu124
Accelerate: 1.3.0
Datasets: 3.2.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}