dleemiller commited on
Commit
e5236b2
·
verified ·
1 Parent(s): 4b10669

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -306
README.md CHANGED
@@ -1,356 +1,161 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
- - sentence-transformers
4
  - cross-encoder
5
- - reranker
6
- - generated_from_trainer
7
- - dataset_size:5749
8
- - loss:BinaryCrossEntropyLoss
9
- pipeline_tag: text-ranking
10
- library_name: sentence-transformers
11
- metrics:
12
- - pearson
13
- - spearman
14
  model-index:
15
- - name: CrossEncoder
16
  results:
17
  - task:
18
- type: cross-encoder-correlation
19
- name: Cross Encoder Correlation
20
  dataset:
21
- name: sts validation
22
- type: sts-validation
23
  metrics:
24
- - type: pearson
25
  value: 0.8763053568934394
26
- name: Pearson
27
- - type: spearman
28
  value: 0.8688596158541986
29
- name: Spearman
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ---
31
 
32
- # CrossEncoder
33
-
34
- This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model trained using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
35
-
36
- ## Model Details
37
 
38
- ### Model Description
39
- - **Model Type:** Cross Encoder
40
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
41
- - **Maximum Sequence Length:** 512 tokens
42
- - **Number of Output Labels:** 1 label
43
- <!-- - **Training Dataset:** Unknown -->
44
- <!-- - **Language:** Unknown -->
45
- <!-- - **License:** Unknown -->
46
 
47
- ### Model Sources
 
 
48
 
49
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
50
- - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
51
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
52
- - **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
53
-
54
- ## Usage
55
 
56
- ### Direct Usage (Sentence Transformers)
 
 
 
 
57
 
58
- First install the Sentence Transformers library:
59
 
60
- ```bash
61
- pip install -U sentence-transformers
62
- ```
63
 
64
- Then you can load this model and run inference.
65
- ```python
66
- from sentence_transformers import CrossEncoder
 
 
 
 
 
67
 
68
- # Download from the 🤗 Hub
69
- model = CrossEncoder("cross_encoder_model_id")
70
- # Get scores for pairs of texts
71
- pairs = [
72
- ['The little boy is singing and playing the guitar.', 'A baby is playing a guitar.'],
73
- ['executive director of the arms control association in washington daryl kimball stated that-- the iaea report is 1 in a series of bad signs. ', 'executive director of the arms control association in washington daryl kimball stated the israeli document could affect the debate over india.'],
74
- ['it did not say if the men had been hanged in prison. ', 'dozens of such criminals have been hanged in public.'],
75
- ['Child sliding in the snow.', 'Man sleeping on the street.'],
76
- ["Your confusion doesn't make me a liar.", "Then your confusion doesn't make me a liar either."],
77
- ]
78
- scores = model.predict(pairs)
79
- print(scores.shape)
80
- # (5,)
81
-
82
- # Or rank different texts based on similarity to a single text
83
- ranks = model.rank(
84
- 'The little boy is singing and playing the guitar.',
85
- [
86
- 'A baby is playing a guitar.',
87
- 'executive director of the arms control association in washington daryl kimball stated the israeli document could affect the debate over india.',
88
- 'dozens of such criminals have been hanged in public.',
89
- 'Man sleeping on the street.',
90
- "Then your confusion doesn't make me a liar either.",
91
- ]
92
- )
93
- # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
94
- ```
95
 
96
- <!--
97
- ### Direct Usage (Transformers)
98
 
99
- <details><summary>Click to see the direct usage in Transformers</summary>
100
 
101
- </details>
102
- -->
103
 
104
- <!--
105
- ### Downstream Usage (Sentence Transformers)
106
 
107
- You can finetune this model on your own dataset.
 
108
 
109
- <details><summary>Click to expand</summary>
 
 
 
 
 
110
 
111
- </details>
112
- -->
113
 
114
- <!--
115
- ### Out-of-Scope Use
116
 
117
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
118
- -->
119
 
120
- ## Evaluation
121
 
122
- ### Metrics
 
 
 
123
 
124
- #### Cross Encoder Correlation
 
125
 
126
- * Dataset: `sts-validation`
127
- * Evaluated with [<code>CECorrelationEvaluator</code>](https://sbert.net/docs/package_reference/cross_encoder/evaluation.html#sentence_transformers.cross_encoder.evaluation.CECorrelationEvaluator)
 
 
128
 
129
- | Metric | Value |
130
- |:-------------|:-----------|
131
- | pearson | 0.8763 |
132
- | **spearman** | **0.8689** |
133
 
134
- <!--
135
- ## Bias, Risks and Limitations
136
 
137
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
138
- -->
 
 
139
 
140
- <!--
141
- ### Recommendations
142
 
143
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
144
- -->
145
 
146
- ## Training Details
147
 
148
- ### Training Dataset
149
-
150
- #### Unnamed Dataset
151
-
152
- * Size: 5,749 training samples
153
- * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
154
- * Approximate statistics based on the first 1000 samples:
155
- | | sentence_0 | sentence_1 | label |
156
- |:--------|:------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
157
- | type | string | string | float |
158
- | details | <ul><li>min: 17 characters</li><li>mean: 56.58 characters</li><li>max: 234 characters</li></ul> | <ul><li>min: 16 characters</li><li>mean: 57.3 characters</li><li>max: 235 characters</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.53</li><li>max: 1.0</li></ul> |
159
- * Samples:
160
- | sentence_0 | sentence_1 | label |
161
- |:----------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
162
- | <code>The little boy is singing and playing the guitar.</code> | <code>A baby is playing a guitar.</code> | <code>0.56</code> |
163
- | <code>executive director of the arms control association in washington daryl kimball stated that-- the iaea report is 1 in a series of bad signs. </code> | <code>executive director of the arms control association in washington daryl kimball stated the israeli document could affect the debate over india.</code> | <code>0.72</code> |
164
- | <code>it did not say if the men had been hanged in prison. </code> | <code>dozens of such criminals have been hanged in public.</code> | <code>0.36</code> |
165
- * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
166
- ```json
167
- {
168
- "activation_fn": "torch.nn.modules.linear.Identity",
169
- "pos_weight": null
170
- }
171
- ```
172
-
173
- ### Training Hyperparameters
174
- #### Non-Default Hyperparameters
175
-
176
- - `eval_strategy`: steps
177
- - `per_device_train_batch_size`: 96
178
- - `per_device_eval_batch_size`: 96
179
- - `fp16`: True
180
-
181
- #### All Hyperparameters
182
- <details><summary>Click to expand</summary>
183
-
184
- - `overwrite_output_dir`: False
185
- - `do_predict`: False
186
- - `eval_strategy`: steps
187
- - `prediction_loss_only`: True
188
- - `per_device_train_batch_size`: 96
189
- - `per_device_eval_batch_size`: 96
190
- - `per_gpu_train_batch_size`: None
191
- - `per_gpu_eval_batch_size`: None
192
- - `gradient_accumulation_steps`: 1
193
- - `eval_accumulation_steps`: None
194
- - `torch_empty_cache_steps`: None
195
- - `learning_rate`: 5e-05
196
- - `weight_decay`: 0.0
197
- - `adam_beta1`: 0.9
198
- - `adam_beta2`: 0.999
199
- - `adam_epsilon`: 1e-08
200
- - `max_grad_norm`: 1
201
- - `num_train_epochs`: 3
202
- - `max_steps`: -1
203
- - `lr_scheduler_type`: linear
204
- - `lr_scheduler_kwargs`: {}
205
- - `warmup_ratio`: 0.0
206
- - `warmup_steps`: 0
207
- - `log_level`: passive
208
- - `log_level_replica`: warning
209
- - `log_on_each_node`: True
210
- - `logging_nan_inf_filter`: True
211
- - `save_safetensors`: True
212
- - `save_on_each_node`: False
213
- - `save_only_model`: False
214
- - `restore_callback_states_from_checkpoint`: False
215
- - `no_cuda`: False
216
- - `use_cpu`: False
217
- - `use_mps_device`: False
218
- - `seed`: 42
219
- - `data_seed`: None
220
- - `jit_mode_eval`: False
221
- - `use_ipex`: False
222
- - `bf16`: False
223
- - `fp16`: True
224
- - `fp16_opt_level`: O1
225
- - `half_precision_backend`: auto
226
- - `bf16_full_eval`: False
227
- - `fp16_full_eval`: False
228
- - `tf32`: None
229
- - `local_rank`: 0
230
- - `ddp_backend`: None
231
- - `tpu_num_cores`: None
232
- - `tpu_metrics_debug`: False
233
- - `debug`: []
234
- - `dataloader_drop_last`: False
235
- - `dataloader_num_workers`: 0
236
- - `dataloader_prefetch_factor`: None
237
- - `past_index`: -1
238
- - `disable_tqdm`: False
239
- - `remove_unused_columns`: True
240
- - `label_names`: None
241
- - `load_best_model_at_end`: False
242
- - `ignore_data_skip`: False
243
- - `fsdp`: []
244
- - `fsdp_min_num_params`: 0
245
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
246
- - `tp_size`: 0
247
- - `fsdp_transformer_layer_cls_to_wrap`: None
248
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
249
- - `deepspeed`: None
250
- - `label_smoothing_factor`: 0.0
251
- - `optim`: adamw_torch
252
- - `optim_args`: None
253
- - `adafactor`: False
254
- - `group_by_length`: False
255
- - `length_column_name`: length
256
- - `ddp_find_unused_parameters`: None
257
- - `ddp_bucket_cap_mb`: None
258
- - `ddp_broadcast_buffers`: False
259
- - `dataloader_pin_memory`: True
260
- - `dataloader_persistent_workers`: False
261
- - `skip_memory_metrics`: True
262
- - `use_legacy_prediction_loop`: False
263
- - `push_to_hub`: False
264
- - `resume_from_checkpoint`: None
265
- - `hub_model_id`: None
266
- - `hub_strategy`: every_save
267
- - `hub_private_repo`: None
268
- - `hub_always_push`: False
269
- - `gradient_checkpointing`: False
270
- - `gradient_checkpointing_kwargs`: None
271
- - `include_inputs_for_metrics`: False
272
- - `include_for_metrics`: []
273
- - `eval_do_concat_batches`: True
274
- - `fp16_backend`: auto
275
- - `push_to_hub_model_id`: None
276
- - `push_to_hub_organization`: None
277
- - `mp_parameters`:
278
- - `auto_find_batch_size`: False
279
- - `full_determinism`: False
280
- - `torchdynamo`: None
281
- - `ray_scope`: last
282
- - `ddp_timeout`: 1800
283
- - `torch_compile`: False
284
- - `torch_compile_backend`: None
285
- - `torch_compile_mode`: None
286
- - `include_tokens_per_second`: False
287
- - `include_num_input_tokens_seen`: False
288
- - `neftune_noise_alpha`: None
289
- - `optim_target_modules`: None
290
- - `batch_eval_metrics`: False
291
- - `eval_on_start`: False
292
- - `use_liger_kernel`: False
293
- - `eval_use_gather_object`: False
294
- - `average_tokens_across_devices`: False
295
- - `prompts`: None
296
- - `batch_sampler`: batch_sampler
297
- - `multi_dataset_batch_sampler`: proportional
298
- - `router_mapping`: {}
299
- - `learning_rate_mapping`: {}
300
-
301
- </details>
302
-
303
- ### Training Logs
304
- | Epoch | Step | sts-validation_spearman |
305
- |:------:|:----:|:-----------------------:|
306
- | 0.3333 | 20 | 0.8638 |
307
- | 0.6667 | 40 | 0.8646 |
308
- | 1.0 | 60 | 0.8663 |
309
- | 1.3333 | 80 | 0.8688 |
310
- | 1.6667 | 100 | 0.8687 |
311
- | 2.0 | 120 | 0.8689 |
312
-
313
-
314
- ### Framework Versions
315
- - Python: 3.12.2
316
- - Sentence Transformers: 5.0.0
317
- - Transformers: 4.51.3
318
- - PyTorch: 2.7.1+cu126
319
- - Accelerate: 1.9.0
320
- - Datasets: 4.0.0
321
- - Tokenizers: 0.21.2
322
 
323
  ## Citation
324
 
325
- ### BibTeX
326
 
327
- #### Sentence Transformers
328
  ```bibtex
329
- @inproceedings{reimers-2019-sentence-bert,
330
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
331
- author = "Reimers, Nils and Gurevych, Iryna",
332
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
333
- month = "11",
334
- year = "2019",
335
- publisher = "Association for Computational Linguistics",
336
- url = "https://arxiv.org/abs/1908.10084",
337
  }
338
  ```
339
 
340
- <!--
341
- ## Glossary
342
-
343
- *Clearly define terms in order to be accessible across audiences.*
344
- -->
345
-
346
- <!--
347
- ## Model Card Authors
348
-
349
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
350
- -->
351
 
352
- <!--
353
- ## Model Card Contact
354
 
355
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
356
- -->
 
1
  ---
2
+ license: mit
3
+ datasets:
4
+ - dleemiller/wiki-sim
5
+ - sentence-transformers/stsb
6
+ language:
7
+ - en
8
+ metrics:
9
+ - spearmanr
10
+ - pearsonr
11
+ base_model:
12
+ - jhu-clsp/ettin-encoder-32m
13
+ pipeline_tag: text-classification
14
+ library_name: sentence-transformers
15
  tags:
 
16
  - cross-encoder
17
+ - modernbert
18
+ - sts
19
+ - stsb
20
+ - stsbenchmark-sts
 
 
 
 
 
21
  model-index:
22
+ - name: CrossEncoder based on jhu-clsp/ettin-encoder-32m
23
  results:
24
  - task:
25
+ type: semantic-similarity
26
+ name: Semantic Similarity
27
  dataset:
28
+ name: sts test
29
+ type: sts-test
30
  metrics:
31
+ - type: pearson_cosine
32
  value: 0.8763053568934394
33
+ name: Pearson Cosine
34
+ - type: spearman_cosine
35
  value: 0.8688596158541986
36
+ name: Spearman Cosine
37
+ - task:
38
+ type: semantic-similarity
39
+ name: Semantic Similarity
40
+ dataset:
41
+ name: sts dev
42
+ type: sts-dev
43
+ metrics:
44
+ - type: pearson_cosine
45
+ value: 0.8786893775398513
46
+ name: Pearson Cosine
47
+ - type: spearman_cosine
48
+ value: 0.8754715235067954
49
+ name: Spearman Cosine
50
  ---
51
 
52
+ # EttinX Cross-Encoder: Semantic Similarity (STS)
 
 
 
 
53
 
54
+ Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
55
+ I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
56
+ They're simple to use, fast and very accurate.
 
 
 
 
 
57
 
58
+ The Ettin series followed up with new encoders trained on the ModernBERT architecture, with a range of sizes, starting at 17M.
59
+ The reduced parameters and computationally efficient interleaved local/global attention layers make this a very fast model,
60
+ which can easily process a few hundred sentence pairs per second on CPU, and a few thousand per second on my A6000.
61
 
62
+ ---
 
 
 
 
 
63
 
64
+ ## Features
65
+ - **High performing:** Achieves **Pearson: 0.8763** and **Spearman: 0.8689** on the STS-Benchmark test set.
66
+ - **Efficient architecture:** Based on the Ettin-encoder design (32M parameters), offering very fast inference speeds.
67
+ - **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
68
+ - **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.
69
 
70
+ ---
71
 
72
+ ## Performance
 
 
73
 
74
+ | Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed |
75
+ |--------------------------------|--------------------|---------------------|----------------|------------|---------|
76
+ | `ModernCE-large-sts` | **0.9256** | **0.9215** | **8192** | 395M | **Medium** |
77
+ | `ModernCE-base-sts` | **0.9162** | **0.9122** | **8192** | 149M | **Fast** |
78
+ | `stsb-roberta-large` | 0.9147 | - | 512 | 355M | Slow |
79
+ | `stsb-distilroberta-base` | 0.8792 | - | 512 | 82M | Fast |
80
+ | `EttinX-sts-xs` | 0.8763 | 0.8689 | **8192** | 32M | **Very Fast** |
81
+ | `EttinX-sts-xxs` | 0.8414 | 0.8311 | **8192** | 17M | **Very Fast** |
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
+ ---
 
85
 
86
+ ## Usage
87
 
88
+ To use EttinX for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:
 
89
 
90
+ ```python
91
+ from sentence_transformers import CrossEncoder
92
 
93
+ # Load EttinX model
94
+ model = CrossEncoder("dleemiller/EttinX-sts-xs")
95
 
96
+ # Predict similarity scores for sentence pairs
97
+ sentence_pairs = [
98
+ ("It's a wonderful day outside.", "It's so sunny today!"),
99
+ ("It's a wonderful day outside.", "He drove to work earlier."),
100
+ ]
101
+ scores = model.predict(sentence_pairs)
102
 
103
+ print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
104
+ ```
105
 
106
+ ### Output
107
+ The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.
108
 
109
+ ---
 
110
 
111
+ ## Training Details
112
 
113
+ ### Pretraining
114
+ The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
115
+ - **Classifier Dropout:** a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
116
+ - **Objective:** STS-B scores from `cross-encoder/stsb-roberta-large`.
117
 
118
+ ### Fine-Tuning
119
+ Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.
120
 
121
+ ### Validation Results
122
+ The model achieved the following test set performance after fine-tuning:
123
+ - **Pearson Correlation:** 0.8763
124
+ - **Spearman Correlation:** 0.8689
125
 
126
+ ---
 
 
 
127
 
128
+ ## Model Card
 
129
 
130
+ - **Architecture:** Ettin-encoder-32m
131
+ - **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
132
+ - **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
133
+ - **Fine-Tuning Data:** `sentence-transformers/stsb`
134
 
135
+ ---
 
136
 
137
+ ## Thank You
 
138
 
139
+ Thanks to the Johns Hopkins team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.
140
 
141
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  ## Citation
144
 
145
+ If you use this model in your research, please cite:
146
 
 
147
  ```bibtex
148
+ @misc{ettinxstsb2025,
149
+ author = {Miller, D. Lee},
150
+ title = {EttinX STS: An STS cross encoder model},
151
+ year = {2025},
152
+ publisher = {Hugging Face Hub},
153
+ url = {https://huggingface.co/dleemiller/EttinX-sts-xxs},
 
 
154
  }
155
  ```
156
 
157
+ ---
 
 
 
 
 
 
 
 
 
 
158
 
159
+ ## License
 
160
 
161
+ This model is licensed under the [MIT License](LICENSE).