--- language: en license: mit tags: - mistral - lora - peft - transcript-chunking - text-segmentation - topic-detection - transformers model_type: mistral base_model: mistralai/Mistral-7B-v0.2 datasets: - custom-transcript-chunking metrics: - loss - accuracy --- # ๐Ÿง  Mistral LoRA Transcript Chunking Model ## Model Overview This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters. It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts โ€” ideal for preprocessing before summarization, classification, or retrieval. --- ## ๐Ÿงฉ Training Objective The model learns to: - Detect topic changes in unstructured transcripts - Insert `--` where those shifts occur - Preserve the original flow of speech **Example:** --- ## โš™๏ธ Training Configuration - **Base Model:** `mistralai/Mistral-7B-v0.2` - **Adapter Type:** LoRA - **PEFT Library:** `peft==0.10.0` - **Training Framework:** Hugging Face Transformers - **Epochs:** 2 - **Optimizer:** AdamW - **Learning Rate:** 2e-4 - **Batch Size:** 8 - **Sequence Length:** 512 --- ## ๐Ÿ“Š Training Metrics | Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy | |------|----------------|----------------|----------|-------------|---------------------| | 100 | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 | | 200 | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 | | 300 | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 | | 400 | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 | | 500 | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 | **Summary:** Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement. --- ## ๐Ÿงฐ Usage Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel base = "mistralai/Mistral-7B-Instruct-v0.2" adapter = "Dc-4nderson/transcript_summarizer_model" tokenizer = AutoTokenizer.from_pretrained(base) model = AutoModelForCausalLM.from_pretrained(base) model = PeftModel.from_pretrained(model, adapter) text = ( "Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n" "Transcript: Let's start with last week's performance metrics. " "Next, weโ€™ll review upcoming campaign deadlines." ) inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=30000) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ๐Ÿงพ License Released under the MIT License โ€” free for research and commercial use with attribution. ๐Ÿ™Œ Credits Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks. Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.