🧠 Mistral LoRA Transcript Chunking Model

Model Overview

This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B-v0.2 model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
It enables automated topic boundary detection in conversation, meeting, and podcast transcripts β€” ideal for preprocessing before summarization, classification, or retrieval.


🧩 Training Objective

The model learns to:

  • Detect topic changes in unstructured transcripts
  • Insert -- where those shifts occur
  • Preserve the original flow of speech

Example:


βš™οΈ Training Configuration

  • Base Model: mistralai/Mistral-7B-v0.2
  • Adapter Type: LoRA
  • PEFT Library: peft==0.10.0
  • Training Framework: Hugging Face Transformers
  • Epochs: 2
  • Optimizer: AdamW
  • Learning Rate: 2e-4
  • Batch Size: 8
  • Sequence Length: 512

πŸ“Š Training Metrics

Step Training Loss Validation Loss Entropy Num Tokens Mean Token Accuracy
100 0.2961 0.1603 0.1644 204,800 0.9594
200 0.1362 0.1502 0.1609 409,600 0.9603
300 0.1360 0.1451 0.1391 612,864 0.9572
400 0.0951 0.1351 0.1279 817,664 0.9635
500 0.0947 0.1297 0.0892 1,022,464 0.9657

Summary:
Loss steadily decreased during training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.


🧰 Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)

text = (
    "Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
    "Transcript: Let's start with last week's performance metrics. "
    "Next, we’ll review upcoming campaign deadlines."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧾 License

Released under the MIT License β€” free for research and commercial use with attribution.

πŸ™Œ Credits

Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks. Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Dc-4nderson/transcript_summarizer_model 1