Papers
arXiv:2510.26345

MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data

Published on Oct 30
· Submitted by Mykhailo Poliakov on Nov 3
Authors:

Abstract

Introducing synthetic fallacy data through MisSynth enhances the zero-shot classification performance of large language models in detecting scientific misinformation.

AI-generated summary

Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.

Community

Paper author Paper submitter

MisSynth combines Retrieval-Augmented Generation (RAG) to create context-aware synthetic data and Parameter-Efficient Fine-Tuning (PEFT) to fine-tune models. Our main claim is that this pipeline can significantly improve the performance of Large Language Models (LLMs) on the complex task of classifying scientific logical fallacies from the MISSCI dataset.

We provide pieces of evidence to support our claims, primarily through benchmark comparisons on the MISSCI test split.

Substantial Performance Gains

The core evidence is the improvement in F1 scores for fine-tuned models compared to their vanilla baselines.

  • LLaMA 3.1 (8B): F1-score improved from 0.334 (vanilla) to 0.711 (fine-tuned), an absolute gain of 37.7%
  • LLaMA 2 (13B): F1-score improved from 0.218 (vanilla) to 0.681 (fine-tuned), an absolute gain of 46.3%.
  • Gemma 3 (4B): F1-score improved from 0.377 (vanilla) to 0.691 (fine-tuned), an absolute gain of 31.4%.

Outperforming Larger Models

The fine-tuned models surpassed the performance of much larger, state-of-the-art models reported in the original MISSCI paper.

  • The fine-tuned Mistral Small 3.2 (24B) achieved the highest F1-score of 0.718.
  • Above model, along with the fine-tuned LLaMA 3.1 (0.711) and Phi-4 (0.705), all outperformed the vanilla GPT-4, which scored an F1 of 0.649.
  • The fine-tuned LLaMA 2 (13B) (F1 0.681) significantly outperformed the vanilla LLaMA 2 (70B) (F1 0.464).

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.26345 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.26345 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.