vzani commited on
Commit
905e988
·
verified ·
1 Parent(s): f88d95f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -29
README.md CHANGED
@@ -70,33 +70,29 @@ model-index:
70
  ## Model Overview
71
 
72
  This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
73
- Models are trained and evaluated on corpora derived from Brazilian Portuguese datasets **[Fake.br](https://github.com/roneysco/Fake.br-Corpus)** and **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.
74
 
75
  - **Architecture**: Multi-Layer Perceptron (scikit-learn)
76
- - **Features**: TF-IDF over unigrams/bigrams
77
  - **Task**: Binary text classification (Fake vs. True)
78
  - **Language**: Portuguese (`pt`)
79
  - **Framework**: scikit-learn
 
80
 
81
  ---
82
 
83
  ## Available Variants
84
 
85
- - **mlp-tfidf-combined**
86
- Trained on the aligned combined corpus.
87
 
88
- - **mlp-tfidf-fake-br**
89
- Trained on **Fake.br**.
90
 
91
- - **mlp-tfidf-faketrue-br**
92
- Trained on **FakeTrue.Br**.
93
- Includes aligned splits and the original CSV when available.
94
 
95
- Each variant ships with:
96
- - `final_model.joblib`
97
- - `confusion_matrix.png`
98
- - `final_classification_report.parquet`
99
- - `final_predictions.parquet`
100
 
101
  ---
102
 
@@ -130,29 +126,19 @@ These files provide per-class performance and prediction logs for reproducibilit
130
 
131
  ---
132
 
133
- ## Corpus
134
-
135
- The corpora used for training and evaluation are provided in the `corpus/` folder.
136
-
137
- - **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
138
- - **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
139
- - **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.
140
-
141
- ---
142
-
143
  ## How to Use
144
 
145
- This model is a **Keras** model stored as `final_bilstm_model.keras`.
146
 
147
  ```python
148
  import joblib
149
  from huggingface_hub import hf_hub_download
150
 
151
- repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-combined" # or fake-br / faketrue-br
152
  filename = "final_model.joblib"
153
 
154
  model_path = hf_hub_download(repo_id=repo_id, filename=filename)
155
- clf = joblib.load(model_path) # Pipeline or bare estimator
156
 
157
 
158
  def predict(text: str) -> tuple[bool, float]:
@@ -175,10 +161,30 @@ The expected output is a Tuple where the first entry represents the classificati
175
  (False, 1.0)
176
  ```
177
 
 
 
 
 
 
 
 
 
178
  ## License
179
 
180
- [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
181
 
182
  ## Citation
183
 
184
- Coming soon.
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Model Overview
71
 
72
  This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
73
+ The model is trained and evaluated on corpora derived from Brazilian Portuguese dataset **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.
74
 
75
  - **Architecture**: Multi-Layer Perceptron (scikit-learn)
76
+ - **Features**: TF-IDF over unigrams/bigrams/trigrams
77
  - **Task**: Binary text classification (Fake vs. True)
78
  - **Language**: Portuguese (`pt`)
79
  - **Framework**: scikit-learn
80
+ - **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers
81
 
82
  ---
83
 
84
  ## Available Variants
85
 
86
+ - [**mlp-tfidf-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-combined)
87
+ Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.
88
 
89
+ - [**mlp-tfidf-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br)
90
+ Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.
91
 
92
+ - [**mlp-tfidf-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br)
93
+ Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.
 
94
 
95
+ Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
 
 
 
 
96
 
97
  ---
98
 
 
126
 
127
  ---
128
 
 
 
 
 
 
 
 
 
 
 
129
  ## How to Use
130
 
131
+ This model is stored as `final_model.joblib`.
132
 
133
  ```python
134
  import joblib
135
  from huggingface_hub import hf_hub_download
136
 
137
+ repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br" # or fake-br / combined
138
  filename = "final_model.joblib"
139
 
140
  model_path = hf_hub_download(repo_id=repo_id, filename=filename)
141
+ clf = joblib.load(model_path)
142
 
143
 
144
  def predict(text: str) -> tuple[bool, float]:
 
161
  (False, 1.0)
162
  ```
163
 
164
+ ## Source code
165
+
166
+ You can find the source code that produced this model in the repository below:
167
+ - https://github.com/viniciuszani/portuguese-fake-new-classifiers
168
+
169
+ The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
170
+ If you use it, please remember to credit the author and/or cite the work.
171
+
172
  ## License
173
 
174
+ - Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
175
+ - Fine-tuned models and corpora: Released under the same license for academic and research use.
176
 
177
  ## Citation
178
 
179
+ ```bibtex
180
+ @misc{zani2025portuguesefakenews,
181
+ author = {ZANI, Vinícius Augusto Tagliatti},
182
+ title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
183
+ year = {2025},
184
+ pages = {61},
185
+ address = {São Carlos},
186
+ school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
187
+ type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
188
+ note = {Orientador: Prof. Dr. Ivandre Paraboni}
189
+ }
190
+ ```