llama-embed-nemotron-8b
Model Overview
Description:
llama-embed-nemotron-8b is a versatile text embedding model trained by NVIDIA and optimized for retrieval, reranking, semantic similarity, and classification use cases. This model has robust capabilities for multilingual and cross-lingual text retrieval. It is designed to serve as a foundational component in text-based Retrieval-Augmented Generation (RAG) systems.
This model achieves state-of-the-art performance on the multilingual MTEB leaderboard (as of October 21, 2025).
This model is for non-commercial/research use only.
License/Terms of Use
Governing Terms for llama-embed-nemotron-8b model: NVIDIA License 
Additional Information: Llama-3.1 Community License Agreement for meta-llama/Llama-3.1-8B. Acceptable Use Policy. Built with Llama. 
Team
- Yauhen Babakhin
- Radek Osmulski
- Ronay Ak
- Gabriel Moreira
- Mengyao Xu
- Benedikt Schifferer
- Bo Liu
- Even Oldridge
Correspondence to Yauhen Babakhin ([email protected]) and Bo Liu ([email protected]).
Citation
The technical report for the llama-embed-nemotron-8b model will be published soon.
@misc{lee2024nv,
  title={NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models},
  author={Lee, Chankyu and Roy, Rajarshi and Xu, Mengyao and Raiman, Jonathan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  journal={arXiv preprint arXiv:2405.17428},
  year={2024}
}
@misc{moreira2025nvretrieverimprovingtextembedding,
      title={NV-Retriever: Improving text embedding models with effective hard-negative mining}, 
      author={Gabriel de Souza P. Moreira and Radek Osmulski and Mengyao Xu and Ronay Ak and Benedikt Schifferer and Even Oldridge},
      year={2025},
      eprint={2407.15831},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.15831}, 
}
NVIDIA’s Retrieval Models
| Model Name | Use Case | Comment | 
|---|---|---|
| nvidia/omni-embed-nemotron-3b | Research-Only | Omni-Modal Embedding Model for Retrieving Text, Images, Audio, or Video | 
| nvidia/llama-NemoRetriever-ColEmbed-1B-v1 | Research-Only | Smaller Version of nvidia/llama-NemoRetriever-ColEmbed-3B-v1 | 
| nvidia/llama-NemoRetriever-ColEmbed-3B-v1 | Research-Only | #1 ViDoRe V1, V2 and MTEB VisualDocumentRetrieval as of June 27, 2025 | 
| llama-3_2-nemoretriever-1b-vlm-embed-v1 | Commercial Application | MultiModal Embedding Model for Production Use Case of Visual Document Retrieval | 
| llama-3_2-nv-embedqa-1b-v2 | Commercial Application | Text Embedding Model for Production Use Case of Text Document Retrieval | 
| llama-3_2-nemoretriever-500m-rerank-v2 | Commercial Application | Text Reranker Model for Production Use Case of Text Document Retrieval | 
| llama-3_2-nv-rerankqa-1b-v2 | Commercial Application | Text Reranker Model for Production Use Case of Text Document Retrieval | 
| nvidia/NV-Embed-v2 | Research-Only | #1 MTEB as of Aug 30, 2024 | 
| nvidia/MM-Embed | Research-Only | Improved nvidia/NV-Embed-v1 and multimodal embeddings | 
| nvidia/NV-Retriever-v1 | Research-Only | #1 MTEB BEIR as of July 12, 2024 | 
Deployment Geography:
Global 
	
		
	
	
		Use Case: 
	
The llama-embed-nemotron-8b model is intended for researchers developing applications that need to understand or retrieve information from text. It is well-suited for multilingual RAG systems in which queries and documents are textual and may be in different languages. 
	
		
	
	
		Release Date:  
	
Hugging Face on 10/21/2025 via https://huggingface.co/nvidia/llama-embed-nemotron-8b 
Model Architecture:
- Architecture Type: Transformer Decoder 
- Network Architecture: Llama-3.1-8B with bi-directional attention 
- This model was developed based on - meta-llama/Llama-3.1-8Bmodel.
 
- Number of model parameters: 7,504,924,672 
This llama-embed-nemotron-8b embedding model is a fine-tuned version of Llama-3.1-8B transformer decoder architecture, with a bidirectional attention mechanism. The model consists of 32 hidden layers and an embedding size of 4096, and trained on public datasets and synthetically generated datasets. Embedding models for text retrieval are typically trained using a bi-encoder architecture. This involves encoding a pair of sentences (for example, query and chunked passages) independently using the embedding model. Contrastive learning is used to maximize the similarity between the query and the passage that contains the answer, while minimizing the similarity between the query and sampled negative passages not useful to answer the question. 
 
	
		
	
	
		Input: 
	
| Property | Query | Document | 
|---|---|---|
| Input Type | Text | Text | 
| Input Format | List of strings | List of strings | 
| Input Parameter | One-Dimensional (1D) | 1D | 
| Other Properties | Maximum input sequence length is 32768 tokens. | Maximum input sequence length is 32768 tokens. | 
	
		
	
	
		Output: 
	
Output Type(s): Floats 
Output Format: List of floats 
Output Parameters: One-Dimensional (1D) 
Other Properties Related to Output: Model outputs embedding vectors of a dimension 4096 for each text input. 
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. 
 
Usage
The llama-embed-nemotron-8b model is instruction-aware, meaning that it supports custom instructions to improve performance for specific use cases or scenarios. In particular, for Retrieval use case, model expects:
- Queries accompanied with the task instruction in the following template: f"Instruct: {task_instruction}\nQuery: {query}"
- Documents (passages) without any special handling
The model requires transformers version 4.51.0 and flash-attention (for GPU processing)
pip install transformers==4.51.0
pip install flash-attn==2.6.3
You can use either Sentence Transformers like here:
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
attn_implementation = "eager"  # Or "flash_attention_2"
model = SentenceTransformer(
    "nvidia/llama-embed-nemotron-8b",
    trust_remote_code=True,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": "float16"},
    tokenizer_kwargs={"padding_side": "left"},
)
queries = [
    "How do neural networks learn patterns from examples?"
]
documents = [
    "Deep learning models adjust their weights through backpropagation, using gradient descent to minimize error on training data and improve predictions over time.",
    "Market prices are determined by the relationship between how much people want to buy a product and how much is available for sale, with scarcity driving prices up and abundance driving them down.",
]
# NOTE: encode_query uses the "query" prompt automatically
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
scores = (query_embeddings @ document_embeddings.T)
print(scores.tolist())
# [[0.37646484375, 0.057891845703125]]
Or using Hugging Face Transformers like here:
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
def average_pool(last_hidden_states: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    """Average pooling with attention mask."""
    
    last_hidden_states_masked = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    embedding = last_hidden_states_masked.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    embedding = F.normalize(embedding, dim=-1)
    
    return embedding
# Define task and queries
def get_instruction(task_instruction: str, query: str) -> str:
    return f"Instruct: {task_instruction}\nQuery: {query}"
model_name_or_path = "nvidia/llama-embed-nemotron-8b"
attn_implementation = "flash_attention_2" if torch.cuda.is_available() else "eager"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    padding_side="left",
)
# Load model
model = AutoModel.from_pretrained(
    model_name_or_path, 
    trust_remote_code=True,
    torch_dtype=torch.float16,
    attn_implementation=attn_implementation,
).eval()
model = model.to("cuda:0" if torch.cuda.is_available() else "cpu")
# Model is instruction-aware, which requires each query to have a short instruction with the task instruction
task = "Given a question, retrieve passages that answer the question"
queries = [
    get_instruction(task, "How do neural networks learn patterns from examples?"),
]
# No instruction is required for documents corpus
documents = [
    "Deep learning models adjust their weights through backpropagation, using gradient descent to minimize error on training data and improve predictions over time.",
    "Market prices are determined by the relationship between how much people want to buy a product and how much is available for sale, with scarcity driving prices up and abundance driving them down.",
]
input_texts = queries + documents
# Tokenize the input texts
batch_dict = tokenizer(
    text=input_texts,
    max_length=4096,
    padding=True,
    truncation=True,
    return_tensors="pt",
).to(model.device)
attention_mask = batch_dict["attention_mask"]
# Forward pass
model_outputs = model(**batch_dict)
# Average pooling
embeddings = average_pool(model_outputs.last_hidden_state, attention_mask)
scores = (embeddings[:1] @ embeddings[1:].T)
print(scores.tolist())
# [[0.37646484375, 0.0579833984375]]
Software Integration:
Runtime Engine(s):
- TensorRT, Triton 
Supported Hardware Microarchitecture Compatibility: 
- NVIDIA Ampere 
- NVIDIA Hopper 
- NVIDIA Lovelace 
- NVIDIA Pascal 
- NVIDIA Turing 
- NVIDIA Volta 
Preferred/Supported Operating System(s):
- Linux 
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Model Version(s):
llama-embed-nemotron-8b-v1
Training and Testing Datasets
Training Dataset:
Data Modality 
- Text 
Text Training Data Size 
- 1 Billion to 10 Trillion Tokens
Data Collection Method by dataset 
- Hybrid: Human, Automated, Synthetic
Labeling Method by dataset
- Hybrid: Human, Automated, Synthetic
Properties: 16.4M query-passage pairs from public and synthetically generated datasets. 
Testing Dataset:
We test the model on 131 tasks from MMTEB: Massive Multilingual Text Embedding Benchmark (MTEB(Multilingual, v2) split).
Benchmark specs: 
- Number of languages: 1038
- Number of task types: 9
- Number of domains: 20 
MMTEB Leaderboard Benchmark Ranking 
Below we present results for MTEB(Multilingual, v2) split of MMTEB benchmark (as of October 21, 2025). Ranking on MMTEB Leaderboards is performed based on the Borda rank. Each task is treated as a preference voter, which gives votes on the models per their relative performance on the task. The best model obtains the highest number of votes. The model with the highest number of votes across tasks obtains the highest rank. The Borda rank tends to prefer models that perform well broadly across tasks.
| Borda Rank | Model | Borda Votes | Mean (Task) | 
|---|---|---|---|
| 1. | llama-embed-nemotron-8b | 39,573 | 69.46 | 
| 2. | gemini-embedding-001 | 39,368 | 68.37 | 
| 3. | Qwen3-Embedding-8B | 39,364 | 70.58 | 
| 4. | Qwen3-Embedding-4B | 39,099 | 69.45 | 
| 5. | Qwen3-Embedding-0.6B | 37,419 | 64.34 | 
| 6. | gte-Qwen2-7B-instruct | 37,167 | 62.51 | 
| 7. | Linq-Embed-Mistral | 37,149 | 61.47 | 
Data Collection Method by dataset:
- Hybrid: Automated, Human, Synthetic
Labeling Method by dataset:
- Hybrid: Automated, Human, Synthetic 
Properties:  More details about MMTEB benchmark can be found on their leaderboard or in their published paper. 
Inference:
Acceleration Engine: GPU 
Test Hardware: A100 80GB, H100 80GB 
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. 
 
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 2,507
