--- library_name: transformers license: apache-2.0 datasets: - alakxender/voice-synthetic language: - dv tags: - dhivehi-tts --- # CSM-1B Dhivehi Multispeaker Dhivehi speech generation model based on [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b), fine-tuned on synthetic male and female Dhivehi voice data. - **Base Model:** [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b) - **Dataset:** [`alakxender/voice-synthetic`](https://huggingface.co/datasets/alakxender/voice-synthetic) - Female speaker: `role = "0"` - Male speaker: `role = "1"` ## Usage ```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor model_id = "alakxender/csm-1b-dhivehi-2-speakers" device = "cuda" if torch.cuda.is_available() else "cpu" # Load model and processor processor = AutoProcessor.from_pretrained(model_id) model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) # Set speaker and input Dhivehi text role = "0" # "0" for female, "1" for male content = "މެލޭޝިއާގައި އިތުރުކުރާ ޓެކްސް، ދިވެހި ދަރިވަރުންނަށް ބުރައަކަށް ނުވާނެ ގޮތެއް ހޯދައިދޭނަން: ހައިދަރު" conversation = [ {"role": role, "content": [{"type": "text", "text": content}]} ] inputs = processor.apply_chat_template( conversation, tokenize=True, return_dict=True ).to(device) # Generate audio audio = model.generate(**inputs, output_audio=True) # Save to file processor.save_audio(audio, f"output_{role}.wav") ``` More usage info at: [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b) ## Training Details - **Epochs:** 3 - **Global Steps:** 24,408 - **Training Loss:** 0.89 - **Final Loss:** 3.35 - **Gradient Norm:** 3.31 - **Learning Rate:** ~8.38e-7 - **FLOPs:** 436,376,769,022,130,240 - **Runtime:** 4.59 hours - **Samples/sec:** 11.83 - **Steps/sec:** 1.48 ## Dataset Overview [`alakxender/voice-synthetic`](https://huggingface.co/datasets/alakxender/voice-synthetic): - Synthetic TTS dataset with aligned Dhivehi text and audio - Two distinct speaker IDs: - `"0"`: Female synthetic voice - `"1"`: Male synthetic voice ## Notes - The model is suitable for Dhivehi TTS tasks with controllable speaker voice. - Speaker identity is selected via the `role` field in the chat input template. - This setup allows simple voice switching without changing the architecture. ## Disclaimer This fine-tuned checkpoint was created for Dhivehi speech synthesis and is intended for research and educational use only. All voice outputs generated by this model are entirely synthetic. Any resemblance to real persons, living or deceased, is purely coincidental and unintentional. The creators of this model do not endorse or condone the use of this system for: - Impersonation or deepfake purposes - Deceptive content generation - Harassment, misinformation, or manipulation