---
language:
- pt
license: apache-2.0
library_name: scikit-learn
pipeline_tag: text-classification
tags:
- logreg
- tfidf
- scikit-learn
- portuguese
- pt
- fake-news
- binary-classification
metrics:
- accuracy
- precision
- recall
- f1-score
---
# Logistic Regression (TF-IDF) for Fake News Detection (Portuguese)

## Model Overview

This repository contains **Logistic Regression classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
Models are trained and evaluated on corpora derived from Brazilian Portuguese datasets **[Fake.br](https://github.com/roneysco/Fake.br-Corpus)** and **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.

- **Architecture**: Logistic Regression (scikit-learn)
- **Features**: TF-IDF over unigrams/bigrams (optionally trigrams)
- **Task**: Binary text classification (Fake vs. True)
- **Language**: Portuguese (`pt`)
- **Framework**: scikit-learn

---

## Available Variants

- **logreg-tfidf-combined**
  Trained on the aligned combined corpus.

- **logreg-tfidf-fake-br**
  Trained on **Fake.br**.

- **logreg-tfidf-faketrue-br**
  Trained on **FakeTrue.Br**.
  Includes aligned splits and the original CSV when available.

Each variant ships with:
- `final_model.joblib`
- `confusion_matrix.png`
- `final_classification_report.parquet`
- `final_predictions.parquet`

---

## Training Details

```python
{
    "ngram_upper": 3,
    "min_df": 4,
    "max_df": 0.89364813327021,
    "C": 276.8531184839476,
    "solver": "saga",
}
```

---

## Evaluation Results

Evaluation metrics are stored in the repo as:
- `confusion_matrix.png`
- `final_classification_report.parquet`
- `final_predictions.parquet`

These files provide per-class performance and prediction logs for reproducibility.

---

## Corpus

The corpora used for training and evaluation are provided in the `corpus/` folder.

- **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
- **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
- **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.

---

## How to Use

This model is stored as `final_model.joblib`.

```python
import joblib
from huggingface_hub import hf_hub_download

repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-combined"  # or fake-br / faketrue-br
filename = "final_model.joblib"

model_path = hf_hub_download(repo_id=repo_id, filename=filename)
clf = joblib.load(model_path)


def predict(text: str) -> tuple[bool, float]:
    prob = clf.predict_proba([text])[0][1]
    pred = prob >= 0.5

    # Convert the probability in case of Fake
    prob = prob if pred else 1 - prob
    return bool(pred), float(prob)


if __name__ == "__main__":
    text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
    print(predict(text))

```

The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
```
(False, 1.0)
```

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Citation

Coming soon.