π§ Mistral LoRA Transcript Chunking Model
Model Overview
This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B-v0.2 model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
It enables automated topic boundary detection in conversation, meeting, and podcast transcripts β ideal for preprocessing before summarization, classification, or retrieval.
π§© Training Objective
The model learns to:
- Detect topic changes in unstructured transcripts
- Insert
--where those shifts occur - Preserve the original flow of speech
Example:
βοΈ Training Configuration
- Base Model:
mistralai/Mistral-7B-v0.2 - Adapter Type: LoRA
- PEFT Library:
peft==0.10.0 - Training Framework: Hugging Face Transformers
- Epochs: 2
- Optimizer: AdamW
- Learning Rate: 2e-4
- Batch Size: 8
- Sequence Length: 512
π Training Metrics
| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
|---|---|---|---|---|---|
| 100 | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
| 200 | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
| 300 | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
| 400 | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
| 500 | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |
Summary:
Loss steadily decreased during training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.
π§° Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)
text = (
"Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
"Transcript: Let's start with last week's performance metrics. "
"Next, weβll review upcoming campaign deadlines."
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π§Ύ License
Released under the MIT License β free for research and commercial use with attribution.
π Credits
Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks. Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.
- Downloads last month
- 9