Logistic Regression (TF-IDF) for Fake News Detection (Portuguese)
Model Overview
This repository contains Logistic Regression classifiers trained on TF-IDF features for fake news detection in Portuguese. The model is trained and evaluated on corpora derived from Brazilian Portuguese datasets Fake.br and FakeTrue.Br.
- Architecture: Logistic Regression (scikit-learn)
- Features: TF-IDF over unigrams/bigrams/trigrams
- Task: Binary text classification (Fake vs. True)
- Language: Portuguese (
pt) - Framework: scikit-learn
- Training source code: https://github.com/viniciuszani/portuguese-fake-new-classifiers
Available Variants
logreg-tfidf-combined Fine-tuned using the combined dataset from Fake.br and FakeTrue.Br.
logreg-tfidf-fake-br Fine-tuned using the Fake.br dataset from Fake.br.
logreg-tfidf-faketrue-br Fine-tuned using the FakeTrue.Br dataset from FakeTrue.Br.
Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
Training Details
{
"ngram_upper": 3,
"min_df": 4,
"max_df": 0.89364813327021,
"C": 276.8531184839476,
"solver": "saga",
}
Evaluation Results
Evaluation metrics are stored in the repo as:
confusion_matrix.pngfinal_classification_report.parquetfinal_predictions.parquet
These files provide per-class performance and prediction logs for reproducibility.
How to Use
This model is stored as final_model.joblib.
import joblib
from huggingface_hub import hf_hub_download
repo_id = "vzani/portuguese-fake-news-classifier-logreg-tfidf-combined" # or fake-br / faketrue-br
filename = "final_model.joblib"
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
clf = joblib.load(model_path)
def predict(text: str) -> tuple[bool, float]:
prob = clf.predict_proba([text])[0][1]
pred = prob >= 0.5
# Convert the probability in case of Fake
prob = prob if pred else 1 - prob
return bool(pred), float(prob)
if __name__ == "__main__":
text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
print(predict(text))
The expected output is a Tuple where the first entry represents the classification (True for true news and False for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
(False, 0.9997890870088925)
Source code
You can find the source code that produced this model in the repository below:
The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace. If you use it, please remember to credit the author and/or cite the work.
License
- Base model BERTimbau: Apache 2.0
- Fine-tuned models and corpora: Released under the same license for academic and research use.
Citation
@misc{zani2025portuguesefakenews,
author = {ZANI, Vinícius Augusto Tagliatti},
title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
year = {2025},
pages = {61},
address = {São Carlos},
school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
note = {Orientador: Prof. Dr. Ivandre Paraboni}
}
Dataset used to train vzani/portuguese-fake-news-classifier-logreg-tfidf-combined
Evaluation results
- accuracy on combinedtest set self-reported0.937
- precision_macro on combinedtest set self-reported0.938
- recall_macro on combinedtest set self-reported0.937
- f1_macro on combinedtest set self-reported0.937
- precision_weighted on combinedtest set self-reported0.938
- recall_weighted on combinedtest set self-reported0.937
- f1_weighted on combinedtest set self-reported0.937
- n_test_samples on combinedtest set self-reported2157.000