🌐 Looking for the English version? Just scroll down — it's right below! 👇

NorBERT — Ch'ti Sentiment (camembert-base)

🇫🇷 Description (Français)

NorBERT (= 'Nord' + 'BERT' ! 😊) est une version fine-tunée de camembert-base pour l’analyse de sentiments en Ch'ti (langue régionale du Nord de la France).
La tâche cible est la classification de séquence en trois classes :

negatif
neutre
positif

🔧 Protocole expérimental

Dataset : jeu de données artificiel construit spécifiquement pour ce projet (phrases en Ch'ti avec annotation sentimentale).
- Taille : 167 exemples par classe (501 au total).
- Split : train / validation équilibré (2 fichiers CSV).
- Colonnes : classe (label), phrase_chtimi (texte).
Prétraitement :
- Normalisation minimale des labels (positif, neutre, negatif).
- Tokenisation avec camembert-base (max_length=256, truncation).
- Gestion optionnelle du déséquilibre par pondération de la loss (ici inutile car dataset équilibré).
Entraînement :
- Backbone : camembert-base
- Fine-tuning complet (pas de LoRA) sur Google Colab Pro (GPU).
- Optimiseur : AdamW, learning_rate = 2e-5
- Batch size : 16 (train) / 32 (eval)
- Epochs : 5 (early stopping patience=3)
- Loss : CrossEntropyLoss pondérée (robuste pour datasets déséquilibrés).
- Evaluation : accuracy, F1 macro, précision et rappel macro.
Évaluation (validation set) :
- Accuracy : 1.0
- F1-macro : 1.0
- Confusion matrix parfaite
  ⚠️ Résultats probablement biaisés par la proximité train/val → le modèle doit être testé sur des phrases inédites pour valider la généralisation.

Explication : les datasets sont des datasets de synthèse, générés par la combinaison de mots et des phrases parmi des listes combinées, train et validation se ressemblent donc fortement. Ce modèle est un hommage poétique, culturel et une démonstration de compétences pour un portfolio, et bien entendu pas un produit à usage commercial. D'autres modèles beaucoup plus proches d'un usage pro sont visibles sur mon GitHub ou par ailleurs sur mon Hugging Face (irrigation, immobilier).

Publication : modèle et tokenizer poussés sur Hugging Face avec trainer.push_to_hub() et tokenizer.push_to_hub().

👉 Essayez la console de Chat (Space) avec des phrases telles que :

"Ch’est l’baraki, les embouteillages sur la Grand-Place !"
"Ch’est pas mal, la ducasse de Lille."
"In est ravi : ce carbonade flamande !"

📘 Découvrez mes 40 projets IA et sciences STEM ici :
👉 github.com/Jerome-openclassroom

🌍 English Version

Description

NorBERT is a fine-tuned version of camembert-base for sentiment analysis in Ch'ti, a regional language from Northern France.
The task is sequence classification with three labels:

negatif
neutre
positif

🔧 Experimental protocol

Dataset : synthetic dataset created for this project (Ch'ti sentences annotated with sentiment).
- Size : 167 examples per class (501 total).
- Balanced train/validation split (CSV files).
- Columns : classe (label), phrase_chtimi (text).
Preprocessing :
- Label normalization (positif, neutre, negatif).
- Tokenization with camembert-base (max_length=256, truncation).
- Optional class-weighted loss (not needed here since balanced dataset).
Training :
- Backbone : camembert-base
- Full fine-tuning (no LoRA) on Google Colab Pro (GPU).
- Optimizer : AdamW, learning_rate = 2e-5
- Batch size : 16 (train) / 32 (eval)
- Epochs : 5 (early stopping patience=3)
- Loss : Weighted CrossEntropyLoss (robust for class imbalance).
- Metrics : accuracy, macro F1, macro precision/recall.
Evaluation (validation set) :
- Accuracy : 1.0
- F1-macro : 1.0
- Perfect confusion matrix
  ⚠️ Likely overestimation due to similarity between train/val → further testing needed on unseen sentences.

Explanation: the datasets are synthetic, generated by combining words and sentences from predefined lists, so the train and validation sets are very similar. This model is a poetic and cultural tribute, as well as a demonstration of technical skills for a portfolio, and of course not a product intended for commercial use. Other models much closer to professional applications can be found on my GitHub or also on my Hugging Face profile (e.g., irrigation, real estate).

Publication : model and tokenizer pushed to Hugging Face using trainer.push_to_hub() and tokenizer.push_to_hub().

🚀 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "jeromex1/NorBERT_Chti"
tok = AutoTokenizer.from_pretrained(repo)
mdl = AutoModelForSequenceClassification.from_pretrained(repo)

txt = "In est fier de le marché du dimanche"
enc = tok(txt, return_tensors="pt")
with torch.no_grad():
    probs = mdl(**enc).logits.softmax(-1).squeeze()

print({mdl.config.id2label[i]: float(probs[i]) for i in range(len(probs))})

👉 Try the Chat Console (Space) with sample phrases such as:

"Ch’est l’baraki, les embouteillages sur la Grand-Place !"
(It’s a mess, traffic jams on the Grand-Place! — “baraki” is a slang term meaning chaotic or low-class)
"Ch’est pas mal, la ducasse de Lille."
(Not bad at all, the Lille fair. — “ducasse” is a traditional fair in Northern France)
"In est ravi : ce carbonade flamande !"
(We’re delighted — this Flemish stew is amazing! — “Carbonade flamande” is a regional beef dish with beer sauce)

🎯 Why this project matters

NorBERT_Chti is not just a technical demo — it’s a tribute to regional language, cultural identity, and the expressive power of NLP.
It showcases how synthetic data generation, fine-tuning, and deployment can be combined to build a playful yet robust sentiment classifier.
Made for demonstrating end-to-end AI skills in a portfolio.

📘 Discover my 40 AI and STEM science projects here :

👉 github.com/Jerome-openclassroom

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Space using jeromex1/NorBERT_Chti 1

Evaluation results

accuracy on Chti Synthetic Dataset
self-reported

1.000
macro-F1 on Chti Synthetic Dataset
self-reported

1.000

View on Papers With Code