Update README.md
Browse files
README.md
CHANGED
|
@@ -70,33 +70,29 @@ model-index:
|
|
| 70 |
## Model Overview
|
| 71 |
|
| 72 |
This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
|
| 73 |
-
|
| 74 |
|
| 75 |
- **Architecture**: Multi-Layer Perceptron (scikit-learn)
|
| 76 |
-
- **Features**: TF-IDF over unigrams/bigrams
|
| 77 |
- **Task**: Binary text classification (Fake vs. True)
|
| 78 |
- **Language**: Portuguese (`pt`)
|
| 79 |
- **Framework**: scikit-learn
|
|
|
|
| 80 |
|
| 81 |
---
|
| 82 |
|
| 83 |
## Available Variants
|
| 84 |
|
| 85 |
-
- **mlp-tfidf-combined**
|
| 86 |
-
|
| 87 |
|
| 88 |
-
- **mlp-tfidf-fake-br**
|
| 89 |
-
|
| 90 |
|
| 91 |
-
- **mlp-tfidf-faketrue-br**
|
| 92 |
-
|
| 93 |
-
Includes aligned splits and the original CSV when available.
|
| 94 |
|
| 95 |
-
Each variant
|
| 96 |
-
- `final_model.joblib`
|
| 97 |
-
- `confusion_matrix.png`
|
| 98 |
-
- `final_classification_report.parquet`
|
| 99 |
-
- `final_predictions.parquet`
|
| 100 |
|
| 101 |
---
|
| 102 |
|
|
@@ -130,29 +126,19 @@ These files provide per-class performance and prediction logs for reproducibilit
|
|
| 130 |
|
| 131 |
---
|
| 132 |
|
| 133 |
-
## Corpus
|
| 134 |
-
|
| 135 |
-
The corpora used for training and evaluation are provided in the `corpus/` folder.
|
| 136 |
-
|
| 137 |
-
- **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
|
| 138 |
-
- **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
|
| 139 |
-
- **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
## How to Use
|
| 144 |
|
| 145 |
-
This model is
|
| 146 |
|
| 147 |
```python
|
| 148 |
import joblib
|
| 149 |
from huggingface_hub import hf_hub_download
|
| 150 |
|
| 151 |
-
repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-
|
| 152 |
filename = "final_model.joblib"
|
| 153 |
|
| 154 |
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
|
| 155 |
-
clf = joblib.load(model_path)
|
| 156 |
|
| 157 |
|
| 158 |
def predict(text: str) -> tuple[bool, float]:
|
|
@@ -175,10 +161,30 @@ The expected output is a Tuple where the first entry represents the classificati
|
|
| 175 |
(False, 1.0)
|
| 176 |
```
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
## License
|
| 179 |
|
| 180 |
-
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
| 181 |
|
| 182 |
## Citation
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
## Model Overview
|
| 71 |
|
| 72 |
This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
|
| 73 |
+
The model is trained and evaluated on corpora derived from Brazilian Portuguese dataset **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.
|
| 74 |
|
| 75 |
- **Architecture**: Multi-Layer Perceptron (scikit-learn)
|
| 76 |
+
- **Features**: TF-IDF over unigrams/bigrams/trigrams
|
| 77 |
- **Task**: Binary text classification (Fake vs. True)
|
| 78 |
- **Language**: Portuguese (`pt`)
|
| 79 |
- **Framework**: scikit-learn
|
| 80 |
+
- **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
## Available Variants
|
| 85 |
|
| 86 |
+
- [**mlp-tfidf-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-combined)
|
| 87 |
+
Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.
|
| 88 |
|
| 89 |
+
- [**mlp-tfidf-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br)
|
| 90 |
+
Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.
|
| 91 |
|
| 92 |
+
- [**mlp-tfidf-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br)
|
| 93 |
+
Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.
|
|
|
|
| 94 |
|
| 95 |
+
Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
---
|
| 98 |
|
|
|
|
| 126 |
|
| 127 |
---
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
## How to Use
|
| 130 |
|
| 131 |
+
This model is stored as `final_model.joblib`.
|
| 132 |
|
| 133 |
```python
|
| 134 |
import joblib
|
| 135 |
from huggingface_hub import hf_hub_download
|
| 136 |
|
| 137 |
+
repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br" # or fake-br / combined
|
| 138 |
filename = "final_model.joblib"
|
| 139 |
|
| 140 |
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
|
| 141 |
+
clf = joblib.load(model_path)
|
| 142 |
|
| 143 |
|
| 144 |
def predict(text: str) -> tuple[bool, float]:
|
|
|
|
| 161 |
(False, 1.0)
|
| 162 |
```
|
| 163 |
|
| 164 |
+
## Source code
|
| 165 |
+
|
| 166 |
+
You can find the source code that produced this model in the repository below:
|
| 167 |
+
- https://github.com/viniciuszani/portuguese-fake-new-classifiers
|
| 168 |
+
|
| 169 |
+
The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
|
| 170 |
+
If you use it, please remember to credit the author and/or cite the work.
|
| 171 |
+
|
| 172 |
## License
|
| 173 |
|
| 174 |
+
- Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
| 175 |
+
- Fine-tuned models and corpora: Released under the same license for academic and research use.
|
| 176 |
|
| 177 |
## Citation
|
| 178 |
|
| 179 |
+
```bibtex
|
| 180 |
+
@misc{zani2025portuguesefakenews,
|
| 181 |
+
author = {ZANI, Vinícius Augusto Tagliatti},
|
| 182 |
+
title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
|
| 183 |
+
year = {2025},
|
| 184 |
+
pages = {61},
|
| 185 |
+
address = {São Carlos},
|
| 186 |
+
school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
|
| 187 |
+
type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
|
| 188 |
+
note = {Orientador: Prof. Dr. Ivandre Paraboni}
|
| 189 |
+
}
|
| 190 |
+
```
|