---
language: en
license: mit
tags:
- mistral
- lora
- peft
- transcript-chunking
- text-segmentation
- topic-detection
- transformers
model_type: mistral
base_model: mistralai/Mistral-7B-v0.2
datasets:
- custom-transcript-chunking
metrics:
- loss
- accuracy
---

# 🧠 Mistral LoRA Transcript Chunking Model

## Model Overview
This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.  
It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.

---

## 🧩 Training Objective
The model learns to:

- Detect topic changes in unstructured transcripts  
- Insert `--` where those shifts occur  
- Preserve the original flow of speech  

**Example:**

---

## ⚙️ Training Configuration
- **Base Model:** `mistralai/Mistral-7B-v0.2`  
- **Adapter Type:** LoRA  
- **PEFT Library:** `peft==0.10.0`  
- **Training Framework:** Hugging Face Transformers  
- **Epochs:** 2  
- **Optimizer:** AdamW  
- **Learning Rate:** 2e-4  
- **Batch Size:** 8  
- **Sequence Length:** 512  

---

## 📊 Training Metrics

| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
|------|----------------|----------------|----------|-------------|---------------------|
| 100  | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
| 200  | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
| 300  | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
| 400  | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
| 500  | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |

**Summary:**  
Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.

---

## 🧰 Usage Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)

text = (
    "Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
    "Transcript: Let's start with last week's performance metrics. "
    "Next, we’ll review upcoming campaign deadlines."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```
🧾 License

Released under the MIT License — free for research and commercial use with attribution.

🙌 Credits

Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks.
Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.