TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
Abstract
TensorBLEU is a GPU-accelerated BLEU metric implementation for efficient in-training evaluation of natural language processing models, offering significant speedups over CPU-based methods.
Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using torch.unique, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.
Community
Optimization for GPU-based BLEU score (with n-grams based on token ids) for in-training evaluation. In our case in Memory Reinforcement Learning for Reactive Transformer (https://huggingface.co/papers/2510.03561), we are using BLEU combined with cosine similarity (calculated on CPU) for rewards calculation, and copying data between devices creates noticeable bottlenecks for RL loop. TensorBLEU moves all the vectorized calculations to GPU, what’s solving the efficiency problems
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LoRAFusion: Efficient LoRA Fine-Tuning for LLMs (2025)
- Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study (2025)
- FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities (2025)
- Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study (2025)
- APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration (2025)
- MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models (2025)
- Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
 Adam Filipek
							Adam Filipek 
	 
					 
					 
					 
					 
					 
					