Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought Paper • 2510.04230 • Published 24 days ago • 26
AccessEval: Benchmarking Disability Bias in Large Language Models Paper • 2509.22703 • Published Sep 22 • 20
Aligning LLMs for Multilingual Consistency in Enterprise Applications Paper • 2509.23659 • Published Sep 28 • 20
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks Paper • 2509.23673 • Published Sep 28 • 20
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications Paper • 2509.23879 • Published Sep 28 • 20
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31 • 8
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding Paper • 2505.17330 • Published May 22 • 22
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems Paper • 2505.18366 • Published May 23 • 25
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published May 22 • 31
MVTamperBench: Evaluating Robustness of Vision-Language Models Paper • 2412.19794 • Published Dec 27, 2024 • 4
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published Mar 10 • 101