--- language: - pt license: apache-2.0 library_name: scikit-learn pipeline_tag: text-classification tags: - logreg - tfidf - scikit-learn - portuguese - pt - fake-news - binary-classification metrics: - accuracy - precision - recall - f1-score --- # Logistic Regression (TF-IDF) for Fake News Detection (Portuguese) ## Model Overview This repository contains **Logistic Regression classifiers trained on TF-IDF features** for **fake news detection in Portuguese**. Models are trained and evaluated on corpora derived from Brazilian Portuguese datasets **[Fake.br](https://github.com/roneysco/Fake.br-Corpus)** and **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**. - **Architecture**: Logistic Regression (scikit-learn) - **Features**: TF-IDF over unigrams/bigrams (optionally trigrams) - **Task**: Binary text classification (Fake vs. True) - **Language**: Portuguese (`pt`) - **Framework**: scikit-learn --- ## Available Variants - **logreg-tfidf-combined** Trained on the aligned combined corpus. - **logreg-tfidf-fake-br** Trained on **Fake.br**. - **logreg-tfidf-faketrue-br** Trained on **FakeTrue.Br**. Includes aligned splits and the original CSV when available. Each variant ships with: - `final_model.joblib` - `confusion_matrix.png` - `final_classification_report.parquet` - `final_predictions.parquet` --- ## Training Details ```python { "ngram_upper": 3, "min_df": 4, "max_df": 0.89364813327021, "C": 276.8531184839476, "solver": "saga", } ``` --- ## Evaluation Results Evaluation metrics are stored in the repo as: - `confusion_matrix.png` - `final_classification_report.parquet` - `final_predictions.parquet` These files provide per-class performance and prediction logs for reproducibility. --- ## Corpus The corpora used for training and evaluation are provided in the `corpus/` folder. - **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`. - **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`. - **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`. --- ## How to Use This model is stored as `final_model.joblib`. ```python import joblib from huggingface_hub import hf_hub_download repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-combined" # or fake-br / faketrue-br filename = "final_model.joblib" model_path = hf_hub_download(repo_id=repo_id, filename=filename) clf = joblib.load(model_path) def predict(text: str) -> tuple[bool, float]: prob = clf.predict_proba([text])[0][1] pred = prob >= 0.5 # Convert the probability in case of Fake prob = prob if pred else 1 - prob return bool(pred), float(prob) if __name__ == "__main__": text = "BOMBA! A Dilma vai taxar ainda mais os pobres!" print(predict(text)) ``` The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0). ``` (False, 1.0) ``` ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Citation Coming soon.