Update README.md

99e2c67 verified 3 months ago

5.5 kB

	---
	language:
	- pt
	license: apache-2.0
	library_name: scikit-learn
	pipeline_tag: text-classification
	tags:
	- mlp
	- tfidf
	- scikit-learn
	- portuguese
	- pt
	- fake-news
	- binary-classification
	metrics:
	- accuracy
	- precision
	- recall
	- f1-score
	datasets: vzani/corpus-fake-br
	model-index:
	- name: portuguese-fake-news-classifier-mlp-tfidf-fake-br
	results:
	- task:
	type: text-classification
	dataset:
	name: Fake.br
	type: vzani/corpus-fake-br
	split: test
	metrics:
	- name: accuracy
	type: accuracy
	value: 0.922917
	- name: precision_macro
	type: precision
	value: 0.923349
	args:
	average: macro
	- name: recall_macro
	type: recall
	value: 0.922917
	args:
	average: macro
	- name: f1_macro
	type: f1
	value: 0.922897
	args:
	average: macro
	- name: precision_weighted
	type: precision
	value: 0.923349
	args:
	average: weighted
	- name: recall_weighted
	type: recall
	value: 0.922917
	args:
	average: weighted
	- name: f1_weighted
	type: f1
	value: 0.922897
	args:
	average: weighted
	- name: n_test_samples
	type: num
	value: 1440
	---
	# MLP (TF-IDF) for Fake News Detection (Portuguese)

	## Model Overview

	This repository contains MLP classifiers trained on TF-IDF features for fake news detection in Portuguese.
	The model is trained and evaluated on corpora derived from Brazilian Portuguese datasets [Fake.br](https://github.com/roneysco/Fake.br-Corpus) and [FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/).

	- Architecture: Multi-Layer Perceptron (scikit-learn)
	- Features: TF-IDF over unigrams/bigrams/trigrams
	- Task: Binary text classification (Fake vs. True)
	- Language: Portuguese (`pt`)
	- Framework: scikit-learn
	- Training source code: https://github.com/viniciuszani/portuguese-fake-new-classifiers

	---

	## Available Variants

	- [mlp-tfidf-combined](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-combined)
	Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.

	- [mlp-tfidf-fake-br](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br)
	Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.

	- [mlp-tfidf-faketrue-br](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br)
	Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.

	Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.

	---

	## Training Details

	```python
	{
	"n_layers": 2,
	"first_layer_size": 128,
	"second_layer_size": 64,
	"ngram_upper": 3,
	"min_df": 5,
	"max_df": 0.991954939032491,
	"activation": "relu",
	"solver": "lbfgs",
	"alpha": 0.00014375816817663168,
	"learning_rate_init": 0.005261446157045498,
	}
	```

	---

	## Evaluation Results

	Evaluation metrics are stored in the repo as:
	- `confusion_matrix.png`
	- `final_classification_report.parquet`
	- `final_predictions.parquet`

	These files provide per-class performance and prediction logs for reproducibility.

	---

	## How to Use

	This model is stored as `final_model.joblib`.

	```python
	import joblib
	from huggingface_hub import hf_hub_download

	repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br" # or combined / faketrue-br
	filename = "final_model.joblib"

	model_path = hf_hub_download(repo_id=repo_id, filename=filename)
	clf = joblib.load(model_path)


	def predict(text: str) -> tuple[bool, float]:
	prob = clf.predict_proba([text])[0][1]
	pred = prob >= 0.5

	# Convert the probability in case of Fake
	prob = prob if pred else 1 - prob
	return bool(pred), float(prob)


	if __name__ == "__main__":
	text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
	print(predict(text))

	```

	The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
	```
	(False, 1.0)
	```

	## Source code

	You can find the source code that produced this model in the repository below:
	- https://github.com/viniciuszani/portuguese-fake-new-classifiers

	The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
	If you use it, please remember to credit the author and/or cite the work.

	## License

	- Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
	- Fine-tuned models and corpora: Released under the same license for academic and research use.

	## Citation

	```bibtex
	@misc{zani2025portuguesefakenews,
	author = {ZANI, Vinícius Augusto Tagliatti},
	title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
	year = {2025},
	pages = {61},
	address = {São Carlos},
	school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
	type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
	note = {Orientador: Prof. Dr. Ivandre Paraboni}
	}
	```