|
|
--- |
|
|
language: |
|
|
- pt |
|
|
license: apache-2.0 |
|
|
library_name: scikit-learn |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- mlp |
|
|
- tfidf |
|
|
- scikit-learn |
|
|
- portuguese |
|
|
- pt |
|
|
- fake-news |
|
|
- binary-classification |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1-score |
|
|
datasets: vzani/corpus-fake-br |
|
|
model-index: |
|
|
- name: portuguese-fake-news-classifier-mlp-tfidf-fake-br |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
dataset: |
|
|
name: Fake.br |
|
|
type: vzani/corpus-fake-br |
|
|
split: test |
|
|
metrics: |
|
|
- name: accuracy |
|
|
type: accuracy |
|
|
value: 0.922917 |
|
|
- name: precision_macro |
|
|
type: precision |
|
|
value: 0.923349 |
|
|
args: |
|
|
average: macro |
|
|
- name: recall_macro |
|
|
type: recall |
|
|
value: 0.922917 |
|
|
args: |
|
|
average: macro |
|
|
- name: f1_macro |
|
|
type: f1 |
|
|
value: 0.922897 |
|
|
args: |
|
|
average: macro |
|
|
- name: precision_weighted |
|
|
type: precision |
|
|
value: 0.923349 |
|
|
args: |
|
|
average: weighted |
|
|
- name: recall_weighted |
|
|
type: recall |
|
|
value: 0.922917 |
|
|
args: |
|
|
average: weighted |
|
|
- name: f1_weighted |
|
|
type: f1 |
|
|
value: 0.922897 |
|
|
args: |
|
|
average: weighted |
|
|
- name: n_test_samples |
|
|
type: num |
|
|
value: 1440 |
|
|
--- |
|
|
# MLP (TF-IDF) for Fake News Detection (Portuguese) |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**. |
|
|
The model is trained and evaluated on corpora derived from Brazilian Portuguese datasets **[Fake.br](https://github.com/roneysco/Fake.br-Corpus)** and **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**. |
|
|
|
|
|
- **Architecture**: Multi-Layer Perceptron (scikit-learn) |
|
|
- **Features**: TF-IDF over unigrams/bigrams/trigrams |
|
|
- **Task**: Binary text classification (Fake vs. True) |
|
|
- **Language**: Portuguese (`pt`) |
|
|
- **Framework**: scikit-learn |
|
|
- **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers |
|
|
|
|
|
--- |
|
|
|
|
|
## Available Variants |
|
|
|
|
|
- [**mlp-tfidf-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-combined) |
|
|
Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br. |
|
|
|
|
|
- [**mlp-tfidf-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br) |
|
|
Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br. |
|
|
|
|
|
- [**mlp-tfidf-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br) |
|
|
Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br. |
|
|
|
|
|
Each variant has its own confusion matrix, classification report, and predictions stored as artifacts. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
```python |
|
|
{ |
|
|
"n_layers": 2, |
|
|
"first_layer_size": 128, |
|
|
"second_layer_size": 64, |
|
|
"ngram_upper": 3, |
|
|
"min_df": 5, |
|
|
"max_df": 0.991954939032491, |
|
|
"activation": "relu", |
|
|
"solver": "lbfgs", |
|
|
"alpha": 0.00014375816817663168, |
|
|
"learning_rate_init": 0.005261446157045498, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluation metrics are stored in the repo as: |
|
|
- `confusion_matrix.png` |
|
|
- `final_classification_report.parquet` |
|
|
- `final_predictions.parquet` |
|
|
|
|
|
These files provide per-class performance and prediction logs for reproducibility. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
This model is stored as `final_model.joblib`. |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br" # or combined / faketrue-br |
|
|
filename = "final_model.joblib" |
|
|
|
|
|
model_path = hf_hub_download(repo_id=repo_id, filename=filename) |
|
|
clf = joblib.load(model_path) |
|
|
|
|
|
|
|
|
def predict(text: str) -> tuple[bool, float]: |
|
|
prob = clf.predict_proba([text])[0][1] |
|
|
pred = prob >= 0.5 |
|
|
|
|
|
# Convert the probability in case of Fake |
|
|
prob = prob if pred else 1 - prob |
|
|
return bool(pred), float(prob) |
|
|
|
|
|
|
|
|
if __name__ == "__main__": |
|
|
text = "BOMBA! A Dilma vai taxar ainda mais os pobres!" |
|
|
print(predict(text)) |
|
|
|
|
|
``` |
|
|
|
|
|
The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0). |
|
|
``` |
|
|
(False, 1.0) |
|
|
``` |
|
|
|
|
|
## Source code |
|
|
|
|
|
You can find the source code that produced this model in the repository below: |
|
|
- https://github.com/viniciuszani/portuguese-fake-new-classifiers |
|
|
|
|
|
The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace. |
|
|
If you use it, please remember to credit the author and/or cite the work. |
|
|
|
|
|
## License |
|
|
|
|
|
- Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
- Fine-tuned models and corpora: Released under the same license for academic and research use. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{zani2025portuguesefakenews, |
|
|
author = {ZANI, Vinícius Augusto Tagliatti}, |
|
|
title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português}, |
|
|
year = {2025}, |
|
|
pages = {61}, |
|
|
address = {São Carlos}, |
|
|
school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo}, |
|
|
type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)}, |
|
|
note = {Orientador: Prof. Dr. Ivandre Paraboni} |
|
|
} |
|
|
``` |
|
|
|