arxiv:2510.04849

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Published on Oct 6

· Submitted by

Vasily Konovalov on Oct 17

#1 Paper of the day

Upvote

108

Authors:

Elisei Rykov ,

Vasily Konovalov ,

Julia Belikova

Abstract

PsiloQA, a multilingual dataset with span-level hallucinations, enhances hallucination detection in large language models across 14 languages using an automated pipeline and encoder-based models.

AI-generated summary

Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.

View arXiv page View PDF GitHub 10 Add to collection

Community

Vasily

Paper author Paper submitter 10 days ago

Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question–answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods – including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models – and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

grantsing

9 days ago

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/when-models-lie-we-learn-multilingual-span-level-hallucination-detection-with-psiloqa

vibegpt

6 days ago

It's strange why this paper received over 100 upvotes, as it seems to be addressing a relatively niche area compared to other papers？

apanc

4 days ago

Hallucination detection is pretty much at the core of any application of LLM, imo so this topic is quite important. If we cannot be sure about reliability of the generation we can't really use it. But I agree that other papers of the day were really strong and possibly technically more elaborated.