USER2-base

USER2 is a new generation of the Universal Sentence Encoder for Russian, designed for sentence representation with long-context support of up to 8,192 tokens.

The models are built on top of the RuModernBERT encoders and are fine-tuned for retrieval and semantic tasks.
They also support Matryoshka Representation Learning (MRL) — a technique that enables reducing embedding size with minimal loss in representation quality.

This is a base model with 149 million parameters.

Model Size Context Length Hidden Dim MRL Dims
deepvk/USER2-small 34M 8192 384 [32, 64, 128, 256, 384]
deepvk/USER2-base 149M 8192 768 [32, 64, 128, 256, 384, 512, 768]

Performance

To evaluate the model, we measure quality on the MTEB-rus benchmark. Additionally, to measure long-context retrieval, we run Russian subset of MultiLongDocRetrieval (MLDR) task.

MTEB-rus

Model Size Hidden Dim Context Length MRL support Mean(task) Mean(taskType) Classification Clustering MultiLabelClassification PairClassification Reranking Retrieval STS
USER-base 124M 768 512 58.11 56.67 59.89 53.26 37.72 59.76 55.58 56.14 74.35
USER-bge-m3 359M 1024 8192 62.80 62.28 61.92 53.66 36.18 65.07 68.72 73.63 76.76
multilingual-e5-base 278M 768 512 58.34 57.24 58.25 50.27 33.65 54.98 66.24 67.14 70.16
multilingual-e5-large-instruct 560M 1024 512 65.00 63.36 66.28 63.13 41.15 63.89 64.35 68.23 76.48
jina-embeddings-v3 572M 1024 8192 63.45 60.93 65.24 60.90 39.24 59.22 53.86 71.99 76.04
ru-en-RoSBERTa 404M 1024 512 61.71 60.40 62.56 56.06 38.88 60.79 63.89 66.52 74.13
USER2-small 34M 384 8192 58.32 56.68 59.76 57.06 33.56 54.02 58.26 61.87 72.25
USER2-base 149M 768 8192 61.12 59.59 61.67 59.22 36.61 56.39 62.06 66.90 74.28

MLDR-rus

Model Size nDCG@10 ↑
USER-bge-m3 359M 58.53
KaLM-v1.5 494M 53.75
jina-embeddings-v3 572M 49.67
E5-mistral-7b 7.11B 52.40
USER2-small 34M 51.69
USER2-base 149M 54.17

We compare only model with context length of 8192.

Matryoshka

To evaluate MRL capabilities, we also use MTEB-rus, applying dimensionality cropping to the embeddings to match the selected size.

MRL

Usage

Prefixes

This model is trained similarly to Nomic Embed and expects task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. We follow a few general guidelines when selecting a prefix:

  • "classification: " is the default and most universal prefix, often performing well across a variety of tasks.
  • "clustering: " is recommended for clustering applications: group texts into clusters, discover shared topics, or remove semantic duplicates.
  • "search_query: " and "search_document: " are intended for retrieval and reranking tasks. Also, in some classification tasks, especially with shorter texts, "search_query" shows superior performance to other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.

However, we encourage users to experiment with different prefixes, as certain domains may benefit from specific ones.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("deepvk/USER2-base")

query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")

similarities = model.similarity(query_embeddings, document_embeddings)

To truncate the embedding dimension, simply pass the new value to the model initialization:

model = SentenceTransformer("deepvk/USER2-base", truncate_dim=128)

This model was trained with dimensions [32, 64, 128, 256, 384, 512, 768], so it’s recommended to use one of these for best performance.

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-base")
model = AutoModel.from_pretrained("deepvk/USER2-base")

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

similarities = query_embeddings @ doc_embeddings.T

To truncate the embedding dimension, select the first values:

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)

Training details

This is the base version with 149 million parameters, based on RuModernBERT-base.
It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.

Following the bge-m3 training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step.
Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality—particularly for long-context inputs.

To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs.
However, such datasets are not publicly available for Russian, unlike for English or Chinese.
To overcome this, we apply two complementary strategies:

  • Cross-lingual transfer: We train on both English and Russian data, leveraging English resources (nomic-unsupervised) alongside our in-house English-Russian parallel corpora.
  • Unsupervised pair mining: From the deepvk/cultura_ru_edu corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.

This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs—especially when compared to pipelines used for other languages.

The table below shows the datasets used and the number of times each was upsampled.

Dataset Size Upsample
nomic-en 235M 1
nomic-ru 39M 3
in-house En-Ru parallel 250M 1
cultura-sampled 50M 1
Total 652M

For the third stage, we switch to cleaner, task-specific datasets.
In some cases, additional filtering was applied using a cross-encoder.
For all retrieval datasets, we mine hard negatives.

Dataset Examples Notes
Nomic-en-supervised 1.7 M Unmodified
AllNLI 200 K Translated SNLI/MNLI/ANLI to Russian
fishkinet-posts 93 K Title–content pairs
gazeta 55 K Title–text pairs
habr_qna 100 K Title–description pairs
lenta 100 K Title–news pairs
miracl_ru 10 K One positive per anchor
mldr_ru 1.8 K Unmodified
mr-tydi_ru 5.3 K Unmodified
mmarco_ru 500 K Unmodified
ru-HNP 100 K One pos + one neg per anchor
ru‑queries 199 K In-house (generated as in arXiv:2401.00368)
ru‑WaNLI 35 K Entailment -> pos, contradiction -> neg
sampled_wiki 1 M Sampled text blocks from Wikipedia
summ_dialog_news 37 K Summary–info pairs
wikiomnia_qna 100 K QA pairs (T5-generated)
yandex_q 83 K Q+desc-answer pairs
Total 4.3 M

Ablation

Alongside the final model, we also release all intermediate training steps.
Both the retromae and weakly_sft models are available under the specified revisions in this repository.
We hope these additional models prove useful for your experiments.

Below is a comparison of all training stages on a subset of MTEB-rus.

training_stages

Citations

@misc{deepvk2025user,
    title={USER2},
    author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/USER2-base},
    publisher={Hugging Face}
    year={2025},
}
Downloads last month
3,614
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW

Model tree for deepvk/USER2-base

Finetuned
(9)
this model
Finetunes
1 model

Datasets used to train deepvk/USER2-base

Collection including deepvk/USER2-base