🧠 Mistral LoRA Transcript Chunking Model

Model Overview

This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B-v0.2 model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
It enables automated topic boundary detection in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.

🧩 Training Objective

The model learns to:

Detect topic changes in unstructured transcripts
Insert -- where those shifts occur
Preserve the original flow of speech

Example:

⚙️ Training Configuration

Base Model: mistralai/Mistral-7B-v0.2
Adapter Type: LoRA
PEFT Library: peft==0.10.0
Training Framework: Hugging Face Transformers
Epochs: 2
Optimizer: AdamW
Learning Rate: 2e-4
Batch Size: 8
Sequence Length: 512

📊 Training Metrics

Step	Training Loss	Validation Loss	Entropy	Num Tokens	Mean Token Accuracy
100	0.2961	0.1603	0.1644	204,800	0.9594
200	0.1362	0.1502	0.1609	409,600	0.9603
300	0.1360	0.1451	0.1391	612,864	0.9572
400	0.0951	0.1351	0.1279	817,664	0.9635
500	0.0947	0.1297	0.0892	1,022,464	0.9657

Summary:
Loss steadily decreased during training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.

🧰 Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)

text = (
    "Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
    "Transcript: Let's start with last week's performance metrics. "
    "Next, we’ll review upcoming campaign deadlines."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧾 License

Released under the MIT License — free for research and commercial use with attribution.

🙌 Credits

Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks. Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.

Downloads last month: 9

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support